Documentation Index
Fetch the complete documentation index at: https://api-docs.rhombus.community/llms.txt
Use this file to discover all available pages before exploring further.
Overview
The LAN agent’s WebSocket video stream embeds AI detection results directly into the H.264 encapsulation header. When the on-camera inference pipeline produces a new detection, it is spliced as a TLV field into the next outgoing video frame on the same WebSocket — no separate detection channel. This guide covers:- The TLV encapsulation format and the JSON schema inside the
AI_DETECTIONSfield - Connecting to the live LAN H.264 WebSocket and reading both binary frames and the text init message
- Drawing detection boxes on top of
RhombusRealtimePlayerusing a parallel detection-only WebSocket - A from-scratch parser reference for non-React consumers
If you only need a player, embed
RhombusRealtimePlayer — it handles auth, WebCodecs decoding, and resolution negotiation. This guide is for adding a detection-overlay layer on top, or for clients that don’t use the React SDK.Connecting to the LAN realtime stream
Get the WebSocket URL
CallPOST /api/camera/getMediaUris and read:
lanLiveH264Uris(array of strings) — LAN URLs, when the client and camera share a networkwanLiveH264Uri(string) — WAN URL, routed through Rhombus
/ws for /wsl in the path.
Authenticate
Both modes use a federated session token minted on your backend viaPOST /api/org/generateFederatedSessionToken. Never put your API key in browser code.
| Mode | Auth method |
|---|---|
| WAN | Append ?x-auth-scheme=federated-token&x-auth-ft=<TOKEN> to the URL before opening the WebSocket. |
| LAN | Set an RFT=<TOKEN> cookie scoped to the camera’s domain before opening the WebSocket. |
What the server sends
Immediately after the WebSocket upgrade and before any binary frames, the server sends a single text message describing the stream:Encapsulation header (TLV format)
Each binary message contains a sequence of TLVs. Every TLV uses the same wire format:TLV types
| Type | Name | Value | Notes |
|---|---|---|---|
0x00 | SPS_PPS_IFRAME | H.264 NAL data | Keyframe (SPS/PPS/I-frame). Always last TLV in the message. |
0x01 | NON_IFRAME | H.264 NAL data | Delta frame (P/B). Always last TLV in the message. |
0x02 | TIMESTAMP | 8-byte uint64 BE | Server wall-clock time in milliseconds. |
0x03 | PTS_US | 8-byte uint64 BE | Third-party PTS in microseconds. Optional; used for B-frame reordering. |
0x04 | AI_DETECTIONS | UTF-8 JSON string | New AI detections. Present only when a new inference result is available. Not null-terminated — use the length field. |
Wire layout
0x00 or 0x01) is always the last entry — the LAN agent’s encoder explicitly inserts metadata TLVs ahead of the frame entry. A safe parser stops walking TLVs once it encounters a frame-data type.
Parsing the encapsulation header
Walk TLV fields until you hit type0x00 or 0x01 (the frame-data entry):
parseRhombusH264Binary.ts — the canonical client-side reference.
Detection JSON schema
AI_DETECTIONS carries a JSON array of detection objects. All detections from a single inference share the same ts.
Required fields
| Field | Type | Units | Description |
|---|---|---|---|
t | int | enum | Detection type. 0 Human, 1 Vehicle, 2 Face, 3 License Plate (LPR), 4 Pose, 5 CLIP Embedding |
c | int | permyriad (0–10000) | Confidence. Divide by 100 for percent. |
id | int | — | Tracker object id. Stable across frames for the same tracked object. |
b | int[4] | permyriad | Bounding box [left, top, right, bottom] |
ts | int | ms epoch | Frame timestamp the AI pipeline analyzed. Use this for frame-accurate alignment. |
uuid | string | RUUID | Parent event UUID |
rs | float | seconds | Relative-second timestamp within the event |
Optional fields
| Field | Type | Notes |
|---|---|---|
clr | object | Color histogram. Keys are color names (e.g. "red", "blue"); values are permyriad. |
tight_crop_xxyy | int[4] | Tight bbox within the detection’s crop window: [x_min, x_max, y_min, y_max] (permyriad). Useful when the consumer wants a tighter box than the padded detection window. |
ec | int | Embedding confidence (permyriad), present when an embedding is computed. |
et | string | Embedding type identifier. |
e | string | Embedding vector (string-encoded; length depends on type). |
il | string | Image-locator reference for the detection’s crop. |
Example
Forward-compatible parsing. Future firmware releases will add LPR text (
lp_chars, lp_confidence), pose skeletons (pose_permyriad_points — 38-joint, not the 17-joint COCO set), and re-identification embeddings. Treat all unrecognized fields as optional and ignore unknown keys, so your client keeps working when those fields land.Drawing bounding boxes on a canvas
Bounding box coordinates are permyriad (0–10000) and resolution-independent. Convert to pixels using the canvas dimensions:b: [1200, 3400, 4500, 8900], this yields (x=153.6, y=244.8, w=422.4, h=396.0).
Timing behavior
- Detections are not present on every frame. The AI pipeline analyzes a subset of frames (typically 2–10 fps). Most frames carry no
AI_DETECTIONSTLV. det.tsmay precede the carrier frame’sTIMESTAMPby up to ~250 ms because the inference pipeline and the encoder run independently — the new detection rides whatever frame happens to leave the encoder next. Align overlays ondet.ts, not on the enclosing frame’sTIMESTAMP, especially for VOD or buffered playback.- Persist between updates. To keep boxes visible between detection updates, hold the most recent set and keep redrawing it until a newer set arrives or a TTL elapses. A 2-second TTL is a safe default.
Extending RhombusRealtimePlayer with detection rendering
The React SDK’s RhombusRealtimePlayer doesn’t currently surface AI detections to the host application. Until it does, the simplest pattern is to open a second WebSocket to the same URL purely to read AI_DETECTIONS, and draw the result on a <canvas> overlaid on the player.
detectionWsUrl on your backend the same way the SDK does: call getMediaUris, pick the appropriate wanLiveH264Uri or LAN entry, then append ?x-auth-scheme=federated-token&x-auth-ft=<TOKEN> (WAN) or set the RFT cookie (LAN) before passing the URL to the browser.
From-scratch parser reference
For non-React clients (a vanilla web page, Node, Electron), the same parser drives a minimal overlay. Decoding the H.264 itself requires WebCodecs (browser) or ffmpeg/libav (Node) and is out of scope, but reading detections from the WebSocket needs only the parser above:HTTP streams vs WebSocket
The HTTPvideo/h264 stream variant strips the encapsulation header entirely and delivers only raw H.264 NAL data. Detections ride only on the WebSocket transport. Use the WebSocket URLs from getMediaUris (lanLiveH264Uris / wanLiveH264Uri) for any flow that needs detections.
Troubleshooting
Boxes appear in the wrong location Bbox coordinates are permyriad (0–10000), not pixels and not 0–1. Make sure the renderer divides by 10000 before multiplying by the canvas dimensions. Boxes appear to lag the video Align ondet.ts, not the carrier frame’s TIMESTAMP. The detection rides whatever frame leaves the encoder next, which can trail the analyzed frame by ~250 ms.
Boxes vanish for a few hundred milliseconds, then reappear
The AI pipeline produces results at 2–10 fps and detections do not ride every video frame. Persist the most-recent detection set with a TTL (e.g. 2 s) so the overlay stays stable between updates.
Receiver only ever gets binary frames; never sees the init message
Confirm your WebSocket handler accepts text frames before binary frames. The init is a single text message sent once per connection.
LAN cookie auth fails locally
The RFT cookie must be scoped to the camera’s domain. If your app is served from localhost and the camera lives on a different LAN host, the cookie can’t be set by the browser — connect via WAN instead, or proxy through your backend.
Next Steps
React SDK
Drop-in
RhombusRealtimePlayer and RhombusBufferedPlayer components.Streaming Video
HLS, shared streams, thumbnails, and frame capture.