LLM Simulator
The LLM Simulator is a comprehensive performance modeling and simulation tool designed for analyzing Large Language Model (LLM) inference workloads. This tool provides detailed insights into computational performance, memory usage, and system behavior under various hardware configurations and request patterns.
The simulator consists of several key components that work together to model the complete lifecycle of LLM inference requests:
- Request Modeling: Simulates incoming generation requests with configurable arrival patterns
- Engine Simulation: Models LLM inference engines with prefill and decode phases
- Performance Analysis: Provides roofline analysis and hardware-specific performance metrics
- Memory Management: Tracks memory allocation and deallocation across multiple concurrent requests
- Trace Generation: Outputs detailed execution traces for visualization and analysis
Install
Section titled “Install”git clone git@github.com:eth-easl/Scratchpad.gitcd tools/simulator
Architecture
Section titled “Architecture”Core Components
Section titled “Core Components”Request Module (core/request.py
)
Section titled “Request Module (core/request.py)”The GenerationRequest
class represents individual inference requests with the following attributes:
req_id
: Unique identifier for the requestmodel
: Target model name (e.g., “meta-llama/Llama-2-7b-hf”)input_length
: Number of input tokensoutput_length
: Number of output tokens to generatearrive_at
: Timestamp when the request arrivesstatus
: Current state (PENDING, SCHEDULED, PREFILL, GENERATE, EXIT)
Requests progress through different states:
- PENDING: Waiting to be processed
- PREFILL: Initial input processing phase
- GENERATE: Sequential token generation phase
- EXIT: Request completed
Engine Module (core/engine.py
)
Section titled “Engine Module (core/engine.py)”The LLMEngine
class simulates the actual inference execution:
Key Features:
- Manages request queues (waiting, running, finished, failed)
- Handles memory allocation through a memory planner
- Executes prefill and decode phases with accurate timing
- Generates trace events for performance analysis
Execution Phases:
- Prefill Phase: Processes all input tokens at once, computes key-value caches
- Decode Phase: Generates output tokens one at a time, using cached keys and values
Trace Module (core/trace.py
)
Section titled “Trace Module (core/trace.py)”The TraceEvent
dataclass captures execution events for performance analysis:
- Events include prefill, decode, and memory operations
- Timestamps are recorded in microseconds for Chrome Trace Format compatibility
- Supports both duration events (
ph="X"
) and counter events (ph="C"
)
Hardware Module (config/hardware_params.py
)
Section titled “Hardware Module (config/hardware_params.py)”Contains detailed specifications for various GPU hardware platforms:
Supported Hardware:
- NVIDIA A100 (40GB/80GB variants)
- NVIDIA H100 (SXM/PCIe variants)
- NVIDIA A40
- NVIDIA L40
Parameters per Hardware:
- Memory bandwidth (bytes/second)
- Peak compute performance (FLOPS) for different precision levels
- On-chip buffer size
Running a Complete Simulation
Section titled “Running a Complete Simulation”The main simulation entry point is cli/start_simulator.py
:
python cli/start_simulator.py \ --input input/trace_file.json \ --n-engines 4 \ --arrival-rate 2.0 \ --trace-output output/trace.json \ --stats-output output/stats.json
Parameters:
--input
: JSON file containing request traces (each line with “input” and “output” token counts)--n-engines
: Number of LLM engines to simulate.--arrival-rate
: Request arrival rate (requests per second).--trace-output
: Output file for Chrome trace format events--stats-output
: Output file for simulation statistics
Input Trace Format
Section titled “Input Trace Format”The input trace file should contain one JSON object per line:
{"id": "a-1234", "status": "DEFAULT", "created_at": "2024-07-17 13:56:49.399", "finished_at": "2024-07-17 13:56:50.527", "model": "meta-llama/Meta-Llama-3-8B-Instruct", "model_parameters": {"top_p": 1, "max_tokens": 256, "temperature": 0, "presence_penalty": 0, "frequency_penalty": 0}, "reported_token_input": 23, "reported_token_output": 38}
Output Files
Section titled “Output Files”-
Trace Output (
trace.json
): Chrome Trace Format file containing:- Prefill and decode events with timing information
- Memory usage counters
- Request lifecycle events
-
Stats Output (
stats.json
): Summary statistics including:- Request completion times
- System utilization metrics
- Failed request information
- Configuration details
Reading the output trace
Section titled “Reading the output trace”The generated trace can be viewed in Chrome’s trace viewer:
- Open Chrome and navigate to
chrome://tracing
. - Load the
trace.json
file to visualize request timelines, engine utilization, and memory usage.
Utility Functions
Section titled “Utility Functions”Performance Calculations (utils.py
)
Section titled “Performance Calculations (utils.py)”Key functions for performance modeling:
flops_matmul(b, m, n, k, rank=None)
: Calculate FLOPS for matrix multiplicationmemory_matmul(b, m, n, k, w_bit, a_bit, rank=None)
: Calculate memory access patternsroofline_analyze(bandwidth, max_OPS, OPs, memory_access)
: Roofline performance analysisget_linear_layers(...)
: Extract linear layer dimensions from model configuration
Model Layer Analysis
Section titled “Model Layer Analysis”The simulator automatically extracts linear layer dimensions from transformer models:
- Query, Key, Value projections
- Output projection
- Feed-forward gate, up, and down projections
- Supports tensor parallelism with TP size > 1
Request Processing Flow
Section titled “Request Processing Flow”- Request Arrival: Requests are added to the waiting queue
- Memory Check: System verifies if sufficient memory is available
- Prefill Execution: Input tokens are processed, KV cache is built
- Decode Loop: Output tokens are generated sequentially
- Memory Cleanup: Memory is freed when requests complete
Performance Metrics
Section titled “Performance Metrics”The simulator tracks multiple performance metrics:
- Latency: Total time from request arrival to completion
- Throughput: Requests processed per second
- Memory Utilization: Peak and average memory usage
- Hardware Utilization: Percentage of peak theoretical performance
- Queue Times: Time spent waiting vs processing
Advanced Features
Section titled “Advanced Features”Memory Planning
Section titled “Memory Planning”The system includes sophisticated memory management:
- Tracks memory blocks for KV cache storage
- Handles allocation failures gracefully
- Supports different precision levels (weights, activations, KV cache)
Batch Processing
Section titled “Batch Processing”During decode phase, multiple requests can be processed together:
- Dynamic batching based on memory availability
- Batch size affects compute efficiency
- Supports heterogeneous batch sizes
Trace Analysis
Section titled “Trace Analysis”Generated traces can be loaded into Chrome’s trace viewer (chrome://tracing) for detailed visual analysis of:
- Request timelines
- Engine utilization
- Memory usage patterns
- Concurrent execution
Example Use Cases
Section titled “Example Use Cases”- Hardware Selection: Compare different GPU configurations for specific workloads
- Sizing Studies: Determine optimal engine count for target performance
- Bottleneck Analysis: Identify whether system is memory or compute bound
- Capacity Planning: Estimate required resources for expected request patterns
- Algorithm Design: Evaluate impact of different batching strategies
Dependencies
Section titled “Dependencies”The simulator requires several external libraries:
transformers
: For model configuration loadinghumanize
: For formatting large numbersnumpy
: For numerical computationsmatplotlib
/seaborn
: For plotting roofline graphsrich
: For enhanced console output
Configuration
Section titled “Configuration”Hardware parameters can be extended by modifying config/hardware_params.py
. Each hardware platform should specify:
- Memory bandwidth in bytes/second
- Peak compute performance for relevant precision levels
- On-chip buffer size in bytes
Limitations and Assumptions
Section titled “Limitations and Assumptions”- Models inference timing based on analytical models rather than actual execution
- Assumes static hardware parameters (no thermal throttling or frequency scaling)
- Simplified memory model (doesn’t account for fragmentation overhead)
- Network latency and I/O overhead are not modeled
- Request scheduling follows FIFO order with simple memory-based admission control