Guide to choosing metrics for AI projects

A

Annotation

Annotations are marks added to data to identify regions, objects, or other meta information used for model training. They can include outlines around pedestrians, bounding boxes around vehicles, or keypoints on objects.

B

Blueprint

A blueprint is a structured template that defines the parameters and their configurations of a pre-built scene, such as lighting, shaders, object placement, and others, and allows users to adjust them.

Bounding box

A bounding box is a rectangle drawn around an object in an image to indicate its location. Bounding boxes help AI models quickly identify, locate, and classify objects.

C

Cluster

A cluster is a group of computers or GPUs that operate together to perform parallel tasks. In synthetic scene generation, clusters can simultaneously handle different components, such as rendering digital humans, configuring environments, and applying lighting and textures. Together, they generate the full scene faster than a single system.

COCO (Common Objects in Context) JSON format

The COCO JSON format is a standard for storing image annotation data, such as bounding boxes, keypoints, and semantic masks.

D

Data iteration

Data iteration is a training step where the model processes a batch of input data, compares its outputs to the ground truth labels, and updates its internal parameters to reduce errors. Repeating this process across many batches allows the model to gradually learn patterns and improve its accuracy.

Data balancing

Data, or dataset, balancing ensures that training data represents all relevant classes and scenarios. Balanced datasets prevent model bias toward frequent cases and improve performance on rare or underrepresented situations.

Depth map

A depth map represents the distance of each point in a scene from the camera: brighter areas are closer, darker areas are farther away. Depth maps help AI understand the 3D structure of a scene and estimate distances between objects.

Determinism

In data science, determinism is the characteristic of a system that produces the same output when given the same input. In synthetic data generation, this means that using the same data (objects, lighting, camera settings, etc.), every generated scene will be identical. If there are differences, they come from the model itself, not the data, which remains the same. This consistency simplifies testing and evaluation of AI models.

Distributed rendering environment

A distributed rendering environment divides the task of generating synthetic scenes across multiple computers or GPU clusters. Each machine handles part of the work, enabling hundreds or thousands of scenes to be created simultaneously.

Distortion

Distortion, or lens distortion, simulates the optical effects of real camera lenses, where straight lines or shapes may appear curved or bent. In synthetic data generation, these effects are simulated to make scenes resemble real-world camera captures.

Domain adaptation

Domain adaptation is the process of bridging the gap between synthetic and real-world data so that AI models trained on synthetic data perform effectively in real environments.

Domain gap

Domain gap is the difference between the data a model is trained on and the data it encounters in the real world. It can result from variations in appearance, structure, or other characteristics of objects and scenes.

Domain inspector

A domain inspector is a tool that evaluates synthetic data for quality and consistency. It can detect issues such as missing elements, labeling errors, or unrealistic scene properties.

E

Edge case

An edge case is a rare or unusual situation that a model must still handle correctly. These cases are difficult to capture in real-world data, but synthetic data allows them to be included deliberately, helping models learn to manage uncommon scenarios.

Environmental map

An environmental map is a 360° representation of a scene’s surroundings used to create realistic lighting, reflections, and backgrounds in rendering. It allows objects in a synthetic scene to respond consistently to the lighting and reflective properties of their environment.

F

Facial landmarks

Facial landmarks are key points on a face that mark the position of important facial features. They provide reference information for AI to analyze expressions, track facial movements, or recognize faces.

Fisheye lens

A fisheye lens simulates a wide-angle optical effect that curves objects near the edges of an image while keeping the center relatively undistorted.

G

Garden (of AI models)

A garden is a collection of pre-trained AI models that are ready for use. Engineers can select models from the garden to accelerate training, experiment with new data, or benchmark performance without starting from scratch.

Gaze vector

A gaze vector is a line or arrow that represents the direction a person’s eyes are pointing, indicating where their attention or focus is directed.

GPU (Graphics Processing Unit)

A GPU is a processor designed for parallel computation, enabling it to perform many calculations simultaneously. In computer vision, it is used to render 3D scenes, generate images, and accelerate the processing of large datasets.

H

Head segmentation

Head segmentation identifies all pixels corresponding to a person’s head, producing a mask that separates it from the background and other body parts. This enables AI models to analyze head-related features, such as orientation, movement, or expressions.

Hyperparameters

Hyperparameters are the configuration settings that control how an AI model is trained. For example, they include learning rate (how quickly the model updates its parameters), batch size (how many training examples are processed at once), etc.

Hypersynthetic data

Hypersynthetic data is the next step in synthetic data generation. Unlike traditional approaches that generate features randomly and independently, it follows the real-world distribution. Each scene feature is mapped into an n-dimensional space, and samples are drawn systematically rather than randomly. The result is datasets that capture realistic variability and provide broader, more representative coverage of real-world scenarios.

I

Inference

Inference is the process by which a trained AI model processes new input data and applies the patterns and features it learned during training to generate outputs. These outputs include classifications, detections, or tracked positions, based on the model’s prior training.

J

Jitter

Jitter introduces small random changes in the position, orientation, or scale of objects in a dataset. This helps AI models learn to recognize objects even when they appear slightly shifted, rotated, or resized compared to the training examples.

K

Keypoints

Keypoints are specific locations on an object that define its shape, structure, or position. They can be used in both 2D and 3D data to provide reference information for AI models to understand object geometry and spatial relationships.

L

Label

A label is a name or category assigned to an object or data point to indicate its identity or class. For example, labeling a car as “sedan” or a pedestrian as “adult.”

Library (PyTorch, TensorFlow)

A deep learning library is a collection of pre-built tools and functions for creating, training, and evaluating neural networks. Libraries allow engineers to develop AI models efficiently, process data, and generate outputs without writing all underlying algorithms from scratch.

LiDAR

LiDAR is a sensor that emits laser beams and measures the time it takes for them to return, creating a detailed 3D map of the surrounding environment.

M

Metadata

Metadata is additional information that describes the context of a scene or object rather than the object itself. It can include details such as the sensor type, lighting conditions, or weather in the scene.

Modalities

Modalities are the different types of data that a sensor or system can capture from a scene. They might include RGB (color) images, depth maps (distance), infrared or thermal data (heat), LiDAR point clouds (3D structure), radar signals (motion or distance), and others.

Multimodality

Multimodality refers to datasets that contain multiple types of sensor data simultaneously. Combining these different modalities allows AI models to gain a more comprehensive understanding of the environment.

Multispectral rendering

Multispectral rendering captures a scene in multiple types of light, such as visible, infrared, or ultraviolet, providing AI with information beyond what is visible to the human eye. This allows models to detect features or patterns that may not be apparent in a single spectrum.

Multi-GPU architecture

Multi-GPU architecture uses multiple GPUs within a single system to distribute and parallelize computational tasks.

N

NIR (Near-infrared)

NIR is a type of light beyond the visible spectrum. Surfaces that look similar in visible light may reflect NIR differently, helping AI distinguish features or materials. For example, a black seatbelt that blends into a dark jacket in visible light becomes clearly visible in NIR.

Normal map

A normal map shows the direction each point on a surface is facing, using colors to represent orientation. This helps AI understand the shape and details of objects, like bumps or slopes, independently of lighting and shadows.

O

Occlusion labels

Occlusion labels indicate parts of objects that are partially hidden or not visible, such as a driver’s hand behind the steering wheel or a passenger’s face partially covered by a hat. These labels help AI models detect and interpret objects even when they are partially obscured.

OpenPBR (Physically-Based Rendering)

OpenPBR is an open-source shading model that defines how different materials interact with light, providing consistent and realistic appearance across surfaces in 3D scenes.

P

PBR, or Physically-based rendering

PBR is a method for generating synthetic data that simulates the physical behavior of light on surfaces, creating realistic reflections, refractions, and shading. This produces more realistic scenes compared to simpler methods that rely on statistical approximations or procedural rules.

Physical AI

Physical AI refers to AI systems integrated with physical devices, giving them the ability to perceive and interact with the real world. This includes applications in robotics, drones, autonomous vehicles, and other systems that perform tasks in physical environments.

Pinhole lens

A pinhole lens captures a scene through a tiny aperture. It’s a simple, fast way to model a lens in computer vision, ignoring optical effects like distortion.

Pose estimation

Pose estimation identifies key points on a person or object to represent their position and movement. AI uses these points to understand posture, orientation, and motion within a scene.

Post-processing

Post-processing refers to adjustments applied after a scene is rendered in a synthetic data engine. These adjustments can simulate sensor characteristics, optical effects, or environmental conditions, and can be configured for different sensor types or scenarios.

Procedural generation

Procedural generation uses algorithms and rules to automatically generate scenes, producing new layouts each time without manual placement of objects.

Q

Quantum efficiency (QE)

Quantum efficiency measures the proportion of incoming light that a sensor captures. A higher QE means the sensor can capture more detail in low-light conditions.

R

Radar

Radar is a sensor that uses radio waves to measure the distance and motion of objects. By detecting the reflected signals, it provides information about object position, speed, and direction.

Randomization

Randomization introduces variations in key elements of a scene, such as object placement, lighting, camera angles, or textures, each time it is generated. This provides AI models with diverse examples, helping them generalize rather than memorize specific setups.

Ray tracing

Ray tracing is a rendering technique that simulates realistic light behavior, including reflections, refractions, and shadows, to produce physically accurate images of a scene.

Rendering

Rendering is the process of turning a digital scene into an image or video. It simulates light, shadows, and surface details to create a realistic representation of a scene.

Rendering shaders

Rendering shaders are procedures that define how light interacts with materials in a scene. For example, a PBR shader makes a car’s metal gleam, a skin shader shows light scattering on a pedestrian’s face, and a glass shader makes windows shine and reflect.

S

Segmentation

Segmentation is the process of dividing a scene into distinct parts or objects, assigning each a unique identifier or mask.

Semantic mask

A semantic mask is the result of semantic segmentation, where each pixel in an image is assigned a label corresponding to the object or class it belongs to.

Sensor fusion

Sensor fusion combines data from multiple simulated sensors into a unified representation. This allows AI models to perceive and interpret the scene more accurately and reliably than relying on a single sensor type.

Sensor simulation

Sensor simulation replicates how real-world sensors capture data, including their limitations, distortions, and noise. With sensor simulation, AI models learn to interpret sensor outputs accurately and handle imperfections they will encounter in real-world environments.

Synthetic dataset

A synthetic dataset is a collection of computer-generated data, including images, videos, and annotations, produced by a synthetic data engine for training AI models.

T

Texture

A texture is the visual information applied to a 3D object to define its color, patterns, and visual characteristics. It helps synthetic objects appear realistic by representing fine details of their surfaces.

U

UV layout

A UV layout is a 2D representation of a 3D object’s surface that defines how textures are mapped onto the model. The “U” and “V” axes correspond to the horizontal and vertical dimensions of this map.

V

Validation

Validation evaluates how well a trained AI model performs on new, unseen data. It measures the model’s ability to detect objects, recognize patterns, and interpret scenes accurately, highlighting areas where the model may make errors or misclassifications.

X

X-cat (Extended Cardiac-Torso)

X-CAT is a 4D digital human model that represents the heart, torso, and other organs in 3D over time, simulating physiological movements such as heartbeats and respiration.