flash deploy to build and deploy your Flash application:
- Build: Packages your code, dependencies, and manifest.
- Upload: Sends the artifact to Runpod’s storage.
- Provision: Creates or updates Serverless endpoints.
- Configure: Sets up environment variables and service discovery.
Deployment architecture
Flash deploys your application as multiple independent Serverless endpoints. Each endpoint configuration in your worker files becomes a separate endpoint. How Flash deployments work:- One Endpoint class = one Serverless endpoint: Each unique endpoint configuration (defined by its
nameparameter) creates a separate Serverless endpoint with its own URL. - Call any endpoint: After deployment, you can call whichever endpoint you need—
lb_workerfor API requests,gpu_workerfor GPU tasks,cpu_workerfor CPU tasks. - Load balancing endpoints: Create HTTP APIs with custom routes using
.get(),.post(), etc. decorators. - Queue-based endpoints: Run compute tasks using the
/runsyncor/runroutes. - Inter-endpoint communication: Endpoints can call each other’s functions when needed, using the Runpod GraphQL service for discovery.
Deploy to a specific environment
Flash organizes deployments using apps and environments. Deploy to a specific environment using the--env flag:
Post-deployment
After a successful deployment, Flash displays all deployed endpoints grouped by type:Understanding endpoint architecture
Understanding endpoint architecture
The relationship between endpoint configurations and deployed endpoints differs between load-balanced and queue-based endpoints:This creates two separate Serverless endpoints:This creates:
Queue-based endpoints (one function per endpoint)
For queue-based endpoints, each@Endpoint function must have its own unique name:https://api.runpod.ai/v2/abc123xyz(run-model)https://api.runpod.ai/v2/def456xyz(preprocess)
Load-balanced endpoints (multiple routes per endpoint)
For load-balanced endpoints, you can define multiple HTTP routes on a single endpoint:- One Serverless endpoint:
https://abc123xyz.api.runpod.ai(named “api”) - Three HTTP routes:
POST /generate,POST /translate,GET /health
Preview before deploying
You can test your deployment locally using Docker before pushing to production using the--preview flag:
- Builds your project (creates the deployment artifact and manifest).
- Creates a Docker network for inter-container communication.
- Starts one container per endpoint configuration (
lb_worker,gpu_worker,cpu_worker, etc.). - Exposes all endpoints for local testing.
Ctrl+C to stop the preview environment.
Managing deployment size
Runpod Serverless has a 1.5GB deployment limit. Flash automatically excludes packages that are pre-installed in the base image:torch,torchvision,torchaudionumpy,triton
--exclude flag to skip additional packages:
Base image packages
| Configuration type | Base image | Auto-excluded packages |
|---|---|---|
GPU (gpu=) | PyTorch base | torch, torchvision, torchaudio, numpy, triton |
CPU (cpu=) | Python slim | torch, torchvision, torchaudio, numpy, triton |
| Load-balanced | Same as GPU/CPU | Same as GPU/CPU |
Build process
When you runflash deploy (or flash build), Flash:
- Discovers all
@Endpointdecorated functions. - Groups functions by their endpoint name.
- Generates handler files for each endpoint.
- Creates a
flash_manifest.jsonfile for service discovery. - Installs dependencies with Linux x86_64 compatibility.
- Packages everything into
.flash/artifact.tar.gz.
Build artifacts
After building, these artifacts are created in the.flash/ directory:
| Artifact | Description |
|---|---|
.flash/artifact.tar.gz | Deployment package |
.flash/flash_manifest.json | Service discovery configuration |
.flash/.build/ | Temporary build directory (removed by default) |
What gets deployed
When you deploy a Flash app, you’re deploying a build artifact (tarball) onto pre-built Flash Docker images. This architecture is similar to AWS Lambda layers: the base runtime is pre-built, and your code and dependencies are layered on top.The build artifact
The.flash/artifact.tar.gz file (max 1.5 GB) contains:
artifact.tar.gz
lb_worker.py
gpu_worker.py
cpu_worker.py
flash_manifest.json
requirements.txt
[installed dependencies]
torch
transformers
...
The deployment manifest
Theflash_manifest.json file is the brain of your deployment. It tells each endpoint:
- Which functions to execute.
- What Docker image to use.
- How to configure resources (GPUs, workers, scaling).
- Environment variables for workers.
- How to route HTTP requests (for load balancer endpoints).
What gets created on Runpod
For each endpoint configuration in the manifest, Flash creates an independent Serverless endpoint, identified by itsname parameter.
Cross-endpoint communication
When one endpoint needs to call a function on another endpoint:- Manifest lookup: The calling endpoint checks
flash_manifest.jsonfor function-to-resource mapping. - Service discovery: It queries the state manager (Runpod GraphQL API) for target endpoint URL.
- Direct call: It makes an HTTP request directly to the target endpoint.
- Response: The target endpoint executes the function and returns the result.
Calling another endpoint from your code
To call one endpoint from another, import the target endpoint function inside your function body. Flash automatically detects these imports and generates the necessary dispatch stubs. For example, if you have a GPU worker for inference:gpu_worker.py
cpu_worker.py
Call deployed endpoints from scripts
After deploying your Flash app, you can call your@Endpoint functions directly from Python scripts. Flash automatically resolves the app context from your project structure, so in most cases you can run scripts without any additional configuration.
How it works
When you run a script that calls an@Endpoint function, Flash:
- Detects the app context from the project directory structure.
- Looks up the deployed endpoint by name within the resolved app and environment.
- Routes the request to that endpoint using Flash’s sentinel service.
- Returns the result to your script.
@Endpoint function definitions to interact with deployed endpoints without modifying your code.
Example: calling within the same script
The simplest approach is to call the endpoint directly in the same file where it’s defined:Example: importing from another script
You can also import and call endpoints from a separate script:Override the resolved context
Flash resolves the app name from your project’s directory structure. UseFLASH_APP and FLASH_ENV environment variables to override this automatic resolution when needed.
A common use case is when you move a script to a different directory. Since the resolved app name depends on the directory location, moving the script changes the resolved context. To continue targeting the original app, set FLASH_APP explicitly:
Error without context
If Flash cannot resolve the app context and you haven’t set the environment variables, it raises an error:Automatic context in deployed workers
When Flash deploys your app, it automatically setsFLASH_APP and FLASH_ENV environment variables on each worker. This enables cross-endpoint communication within your deployed application without additional configuration.
Troubleshooting
No @Endpoint functions found
If the build process can’t find your endpoint functions:- Ensure functions are decorated with
@Endpoint(...). - Check that Python files aren’t excluded by
.gitignoreor Flash’s built-in ignore patterns. - Verify decorator syntax is correct.
Deployment size limit exceeded
Base image packages are auto-excluded. If your deployment still exceeds 1.5GB, use--exclude to skip additional packages:
Authentication errors
Verify your API key is set correctly:.env file or export it:
Import errors in endpoint functions
Import packages inside the endpoint function, not at the top of the file:Next steps
- Learn about apps and environments for managing deployments.
- View the CLI reference for all available commands.
- Configure hardware resources for your endpoints.
- Monitor and troubleshoot your deployments.