Simulate Apache Spark Workloads Without a Cluster Using Fauxspark
Posted3 months ago
github.comTechstory
calmpositive
Debate
0/100
Apache SparkSimulationTestingData Processing
Key topics
Apache Spark
Simulation
Testing
Data Processing
FauxSpark is a tool that allows users to simulate Apache Spark workloads without a cluster, and the HN community shares it with a calm and positive tone.
Snapshot generated from the HN discussion
Discussion Activity
Light discussionFirst comment
N/A
Peak period
1
Start
Avg / period
1
Key moments
- 01Story posted
Oct 9, 2025 at 9:15 AM EDT
3 months ago
Step 01 - 02First comment
Oct 9, 2025 at 9:15 AM EDT
0s after posting
Step 02 - 03Peak activity
1 comments in Start
Hottest window of the conversation
Step 03 - 04Latest activity
Oct 9, 2025 at 9:15 AM EDT
3 months ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
ID: 45527270Type: storyLast synced: 11/17/2025, 11:11:49 AM
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.
The current version includes:
- DAG scheduling with stages, tasks, and dependencies (but perhaps, designing around "RDD" would've been the right call)
- Modeling the input, output, shuffle partition sizes as probability distributions.
- Automatic retries on executor or shuffle-fetch failures
- Single-job execution
- Simple CLI to tweak cluster configuration, simulate failures, and scaling up executors
This tool might be relevant to the following folks:
- Data & Infrastructure engineers running Apache Spark who want to experiment with cluster configurations
- Anyone curious about Spark internals
I'd appreciate feedback from anyone with experience in discrete event simulation, particularly regarding the planned features as well as from anyone who may find this useful to shape its development.
A walkthrough section in the README demonstrates how it can be used.
GH repo https://github.com/fhalde/fauxspark