Skip to main content

Documentation Index

Fetch the complete documentation index at: https://api-docs.rhombus.community/llms.txt

Use this file to discover all available pages before exploring further.

Overview

The LAN agent’s WebSocket video stream embeds AI detection results directly into the H.264 encapsulation header. When the on-camera inference pipeline produces a new detection, it is spliced as a TLV field into the next outgoing video frame on the same WebSocket — no separate detection channel. This guide covers:
  • The TLV encapsulation format and the JSON schema inside the AI_DETECTIONS field
  • Connecting to the live LAN H.264 WebSocket and reading both binary frames and the text init message
  • Drawing detection boxes on top of RhombusRealtimePlayer using a parallel detection-only WebSocket
  • A from-scratch parser reference for non-React consumers
If you only need a player, embed RhombusRealtimePlayer — it handles auth, WebCodecs decoding, and resolution negotiation. This guide is for adding a detection-overlay layer on top, or for clients that don’t use the React SDK.

Connecting to the LAN realtime stream

Get the WebSocket URL

Call POST /api/camera/getMediaUris and read:
  • lanLiveH264Uris (array of strings) — LAN URLs, when the client and camera share a network
  • wanLiveH264Uri (string) — WAN URL, routed through Rhombus
For the lower-resolution variant, swap /ws for /wsl in the path.

Authenticate

Both modes use a federated session token minted on your backend via POST /api/org/generateFederatedSessionToken. Never put your API key in browser code.
ModeAuth method
WANAppend ?x-auth-scheme=federated-token&x-auth-ft=<TOKEN> to the URL before opening the WebSocket.
LANSet an RFT=<TOKEN> cookie scoped to the camera’s domain before opening the WebSocket.
The full token-minting backend example (Express, FastAPI, Next.js) lives in the React SDK guide — reuse it.

What the server sends

Immediately after the WebSocket upgrade and before any binary frames, the server sends a single text message describing the stream:
{"action":"init","width":1920,"height":1080,"codec":"h264","framerate":15}
Read the dimensions if your renderer needs the source resolution. Bounding boxes are resolution-independent (permyriad units), so most overlays don’t need this. After the init message, every subsequent message is a binary frame containing the TLV-encoded encapsulation header followed by raw H.264 NAL data.

Encapsulation header (TLV format)

Each binary message contains a sequence of TLVs. Every TLV uses the same wire format:
[1 byte type] [3 bytes length, big-endian] [N bytes value]

TLV types

TypeNameValueNotes
0x00SPS_PPS_IFRAMEH.264 NAL dataKeyframe (SPS/PPS/I-frame). Always last TLV in the message.
0x01NON_IFRAMEH.264 NAL dataDelta frame (P/B). Always last TLV in the message.
0x02TIMESTAMP8-byte uint64 BEServer wall-clock time in milliseconds.
0x03PTS_US8-byte uint64 BEThird-party PTS in microseconds. Optional; used for B-frame reordering.
0x04AI_DETECTIONSUTF-8 JSON stringNew AI detections. Present only when a new inference result is available. Not null-terminated — use the length field.

Wire layout

┌──────────────────────────────────────────────────┐
│ TIMESTAMP (0x02): 4 + 8 = 12 bytes               │  always present
├──────────────────────────────────────────────────┤
│ PTS_US (0x03): 4 + 8 = 12 bytes                  │  optional
├──────────────────────────────────────────────────┤
│ AI_DETECTIONS (0x04): 4 + N bytes                │  only when a new
│                                                  │  detection is available
├──────────────────────────────────────────────────┤
│ frame-data (0x00 or 0x01): 4 + N bytes           │  always last;
│                                                  │  value is raw H.264
└──────────────────────────────────────────────────┘
The frame-data TLV (0x00 or 0x01) is always the last entry — the LAN agent’s encoder explicitly inserts metadata TLVs ahead of the frame entry. A safe parser stops walking TLVs once it encounters a frame-data type.

Parsing the encapsulation header

Walk TLV fields until you hit type 0x00 or 0x01 (the frame-data entry):
type ParsedFrame = {
  timestampMs: number | null;
  ptsUs: number | null;
  detectionJson: string | null;
  isKeyframe: boolean;
  h264Data: Uint8Array;
};

export function parseEncapHeader(buffer: ArrayBuffer): ParsedFrame {
  const view = new DataView(buffer);
  const bytes = new Uint8Array(buffer);
  let offset = 0;
  let timestampMs: number | null = null;
  let ptsUs: number | null = null;
  let detectionJson: string | null = null;

  while (offset + 4 <= buffer.byteLength) {
    const type = view.getUint8(offset);
    const len =
      (view.getUint8(offset + 1) << 16) |
      (view.getUint8(offset + 2) << 8) |
      view.getUint8(offset + 3);
    const valueStart = offset + 4;

    if (type === 0x00 || type === 0x01) {
      return {
        timestampMs,
        ptsUs,
        detectionJson,
        isKeyframe: type === 0x00,
        h264Data: bytes.subarray(valueStart, valueStart + len),
      };
    }

    if (type === 0x02) {
      const hi = view.getUint32(valueStart);
      const lo = view.getUint32(valueStart + 4);
      timestampMs = hi * 0x100000000 + lo;
    } else if (type === 0x03) {
      const hi = view.getUint32(valueStart);
      const lo = view.getUint32(valueStart + 4);
      ptsUs = hi * 0x100000000 + lo;
    } else if (type === 0x04) {
      detectionJson = new TextDecoder().decode(
        bytes.subarray(valueStart, valueStart + len)
      );
    }
    // Unknown types are skipped silently.

    offset = valueStart + len;
  }

  throw new Error("Encapsulation header missing frame-data TLV");
}
The Rhombus React SDK uses an equivalent parser at parseRhombusH264Binary.ts — the canonical client-side reference.

Detection JSON schema

AI_DETECTIONS carries a JSON array of detection objects. All detections from a single inference share the same ts.

Required fields

FieldTypeUnitsDescription
tintenumDetection type. 0 Human, 1 Vehicle, 2 Face, 3 License Plate (LPR), 4 Pose, 5 CLIP Embedding
cintpermyriad (0–10000)Confidence. Divide by 100 for percent.
idintTracker object id. Stable across frames for the same tracked object.
bint[4]permyriadBounding box [left, top, right, bottom]
tsintms epochFrame timestamp the AI pipeline analyzed. Use this for frame-accurate alignment.
uuidstringRUUIDParent event UUID
rsfloatsecondsRelative-second timestamp within the event

Optional fields

FieldTypeNotes
clrobjectColor histogram. Keys are color names (e.g. "red", "blue"); values are permyriad.
tight_crop_xxyyint[4]Tight bbox within the detection’s crop window: [x_min, x_max, y_min, y_max] (permyriad). Useful when the consumer wants a tighter box than the padded detection window.
ecintEmbedding confidence (permyriad), present when an embedding is computed.
etstringEmbedding type identifier.
estringEmbedding vector (string-encoded; length depends on type).
ilstringImage-locator reference for the detection’s crop.

Example

[
  {
    "t": 0,
    "c": 8500,
    "id": 3,
    "b": [1200, 3400, 4500, 8900],
    "ts": 1715030400000,
    "uuid": "AAAAAAAAAAAAAAAAAAAAAA",
    "rs": 2.5,
    "clr": {"red": 4000, "blue": 6000},
    "tight_crop_xxyy": [1500, 4200, 3700, 8500]
  }
]
Forward-compatible parsing. Future firmware releases will add LPR text (lp_chars, lp_confidence), pose skeletons (pose_permyriad_points — 38-joint, not the 17-joint COCO set), and re-identification embeddings. Treat all unrecognized fields as optional and ignore unknown keys, so your client keeps working when those fields land.

Drawing bounding boxes on a canvas

Bounding box coordinates are permyriad (0–10000) and resolution-independent. Convert to pixels using the canvas dimensions:
const TYPE_COLORS = {
  0: "#00ff00", // Human    — green
  1: "#0088ff", // Vehicle  — blue
  2: "#ff00ff", // Face     — magenta
  3: "#ffff00", // LPR      — yellow
  4: "#00ffff", // Pose     — cyan
  5: "#ff8800", // CLIP     — orange
};

const TYPE_LABELS = ["Human", "Vehicle", "Face", "LPR", "Pose", "CLIP"];

export function drawDetections(ctx, canvasWidth, canvasHeight, detections) {
  ctx.clearRect(0, 0, canvasWidth, canvasHeight);

  for (const det of detections) {
    const [left, top, right, bottom] = det.b;
    const x = (left / 10000) * canvasWidth;
    const y = (top / 10000) * canvasHeight;
    const w = ((right - left) / 10000) * canvasWidth;
    const h = ((bottom - top) / 10000) * canvasHeight;

    ctx.strokeStyle = TYPE_COLORS[det.t] ?? "#ffffff";
    ctx.lineWidth = 2;
    ctx.strokeRect(x, y, w, h);

    const conf = Math.round(det.c / 100);
    const label = `${TYPE_LABELS[det.t] ?? "Unknown"} ${conf}% #${det.id}`;
    ctx.fillStyle = ctx.strokeStyle;
    ctx.font = "12px monospace";
    ctx.fillText(label, x, Math.max(10, y - 4));
  }
}
For a 1280×720 canvas and b: [1200, 3400, 4500, 8900], this yields (x=153.6, y=244.8, w=422.4, h=396.0).

Timing behavior

  • Detections are not present on every frame. The AI pipeline analyzes a subset of frames (typically 2–10 fps). Most frames carry no AI_DETECTIONS TLV.
  • det.ts may precede the carrier frame’s TIMESTAMP by up to ~250 ms because the inference pipeline and the encoder run independently — the new detection rides whatever frame happens to leave the encoder next. Align overlays on det.ts, not on the enclosing frame’s TIMESTAMP, especially for VOD or buffered playback.
  • Persist between updates. To keep boxes visible between detection updates, hold the most recent set and keep redrawing it until a newer set arrives or a TTL elapses. A 2-second TTL is a safe default.

Extending RhombusRealtimePlayer with detection rendering

The React SDK’s RhombusRealtimePlayer doesn’t currently surface AI detections to the host application. Until it does, the simplest pattern is to open a second WebSocket to the same URL purely to read AI_DETECTIONS, and draw the result on a <canvas> overlaid on the player.
A parallel WebSocket doubles the egress for that camera. Use it only on the page that needs detections, and close it on unmount.
import { RhombusRealtimePlayer } from "@rhombussystems/react";
import { useEffect, useRef, useState } from "react";
import { parseEncapHeader } from "./parseEncapHeader"; // from earlier in this guide
import { drawDetections } from "./drawDetections";     // from earlier in this guide

type Props = {
  cameraUuid: string;
  /** WebSocket URL for this camera (from getMediaUris + auth append) */
  detectionWsUrl: string;
  /** Optional: how long to keep boxes after the last update */
  detectionTtlMs?: number;
};

export function RhombusRealtimePlayerWithDetections({
  cameraUuid,
  detectionWsUrl,
  detectionTtlMs = 2000,
}: Props) {
  const canvasRef = useRef<HTMLCanvasElement>(null);
  const detectionsRef = useRef<any[]>([]);
  const lastTsRef = useRef(0);
  const [size, setSize] = useState({ width: 1920, height: 1080 });

  useEffect(() => {
    const ws = new WebSocket(detectionWsUrl);
    ws.binaryType = "arraybuffer";

    ws.onmessage = (event) => {
      // The server sends one text init message before any binary frames.
      if (typeof event.data === "string") {
        try {
          const init = JSON.parse(event.data);
          if (init.action === "init" && init.width && init.height) {
            setSize({ width: init.width, height: init.height });
          }
        } catch {
          // Ignore non-JSON text messages.
        }
        return;
      }

      try {
        const { detectionJson } = parseEncapHeader(event.data);
        if (!detectionJson) return;
        const dets = JSON.parse(detectionJson);
        detectionsRef.current = dets;
        lastTsRef.current = Date.now();
      } catch (err) {
        console.warn("Failed to parse encap header", err);
      }
    };

    return () => ws.close();
  }, [detectionWsUrl]);

  useEffect(() => {
    let raf = 0;

    const tick = () => {
      const canvas = canvasRef.current;
      if (canvas) {
        const ctx = canvas.getContext("2d");
        if (ctx) {
          const fresh = Date.now() - lastTsRef.current < detectionTtlMs;
          drawDetections(
            ctx,
            canvas.width,
            canvas.height,
            fresh ? detectionsRef.current : []
          );
        }
      }
      raf = requestAnimationFrame(tick);
    };

    raf = requestAnimationFrame(tick);
    return () => cancelAnimationFrame(raf);
  }, [detectionTtlMs]);

  return (
    <div style={{ position: "relative", width: size.width, height: size.height }}>
      <RhombusRealtimePlayer
        cameraUuid={cameraUuid}
        connectionMode="wan"
      />
      <canvas
        ref={canvasRef}
        width={size.width}
        height={size.height}
        style={{
          position: "absolute",
          inset: 0,
          width: "100%",
          height: "100%",
          pointerEvents: "none",
        }}
      />
    </div>
  );
}
Resolve detectionWsUrl on your backend the same way the SDK does: call getMediaUris, pick the appropriate wanLiveH264Uri or LAN entry, then append ?x-auth-scheme=federated-token&x-auth-ft=<TOKEN> (WAN) or set the RFT cookie (LAN) before passing the URL to the browser.

From-scratch parser reference

For non-React clients (a vanilla web page, Node, Electron), the same parser drives a minimal overlay. Decoding the H.264 itself requires WebCodecs (browser) or ffmpeg/libav (Node) and is out of scope, but reading detections from the WebSocket needs only the parser above:
<canvas id="overlay" width="1920" height="1080"
  style="position:absolute; inset:0; pointer-events:none;"></canvas>

<script type="module">
  import { parseEncapHeader } from "./parseEncapHeader.js";
  import { drawDetections } from "./drawDetections.js";

  // Resolve detectionWsUrl from your backend after calling
  // /api/camera/getMediaUris and appending the federated token.
  const ws = new WebSocket(detectionWsUrl);
  ws.binaryType = "arraybuffer";

  const canvas = document.getElementById("overlay");
  const ctx = canvas.getContext("2d");
  let lastDetections = [];
  let lastTs = 0;
  const TTL_MS = 2000;

  ws.onmessage = (event) => {
    if (typeof event.data === "string") return; // skip init handshake
    const { detectionJson } = parseEncapHeader(event.data);
    if (!detectionJson) return;
    lastDetections = JSON.parse(detectionJson);
    lastTs = Date.now();
  };

  function tick() {
    const fresh = Date.now() - lastTs < TTL_MS;
    drawDetections(ctx, canvas.width, canvas.height, fresh ? lastDetections : []);
    requestAnimationFrame(tick);
  }
  requestAnimationFrame(tick);
</script>

HTTP streams vs WebSocket

The HTTP video/h264 stream variant strips the encapsulation header entirely and delivers only raw H.264 NAL data. Detections ride only on the WebSocket transport. Use the WebSocket URLs from getMediaUris (lanLiveH264Uris / wanLiveH264Uri) for any flow that needs detections.

Troubleshooting

Boxes appear in the wrong location Bbox coordinates are permyriad (0–10000), not pixels and not 0–1. Make sure the renderer divides by 10000 before multiplying by the canvas dimensions. Boxes appear to lag the video Align on det.ts, not the carrier frame’s TIMESTAMP. The detection rides whatever frame leaves the encoder next, which can trail the analyzed frame by ~250 ms. Boxes vanish for a few hundred milliseconds, then reappear The AI pipeline produces results at 2–10 fps and detections do not ride every video frame. Persist the most-recent detection set with a TTL (e.g. 2 s) so the overlay stays stable between updates. Receiver only ever gets binary frames; never sees the init message Confirm your WebSocket handler accepts text frames before binary frames. The init is a single text message sent once per connection. LAN cookie auth fails locally The RFT cookie must be scoped to the camera’s domain. If your app is served from localhost and the camera lives on a different LAN host, the cookie can’t be set by the browser — connect via WAN instead, or proxy through your backend.

Next Steps

React SDK

Drop-in RhombusRealtimePlayer and RhombusBufferedPlayer components.

Streaming Video

HLS, shared streams, thumbnails, and frame capture.
Last modified on May 7, 2026