Ethereum Prysm client experiences mainnet incident: Resource exhaustion leads to large-scale block and witness loss
The Prysm team released a report on the mainnet incident stating that on December 4th, during the Ethereum mainnet Fusaka period, nearly all Prysm beacon nodes experienced resource exhaustion while processing specific attestations, resulting in an inability to timely respond to validator requests, causing a large number of block and witness losses.
The impact of the incident ranged from epoch 411439 to 411480, with a total of 42 epochs, where 248 blocks were missing out of 1344 slots, resulting in a missing rate of approximately 18.5%; the network participation rate once dropped to 75%, with validators losing witness rewards of about 382 ETH. The root cause was that Prysm received attestations from nodes that may have been out of sync with the mainnet, which referenced the block root of the previous epoch. To verify their legitimacy, Prysm repeatedly replayed the old epoch state and executed high-cost epoch transitions, leading to resource exhaustion under high concurrency. The related defect originated from Prysm PR 15965, which had been deployed to the testnet a month ago but did not trigger the same scenario.
The official temporary solution provided is to enable the --disable-last-epoch-target parameter in version v7.0.0; the subsequently released v7.0.1 and v7.1.0 have included long-term fixes, avoiding repeated playback of historical states by using head state to validate attestations. Prysm stated that the issue gradually alleviated after December 4th, 4:45 UTC, and by epoch 411480, the network participation rate recovered to above 95%.
The Prysm team pointed out that this incident highlights the importance of client diversity; if a single client accounts for more than one-third, it may lead to a temporary inability to finalize; if it exceeds two-thirds, there is a risk of an invalid chain at finality. They also reflected on the unclear communication of feature switches and the testing environment's failure to simulate large-scale out-of-sync nodes, and will improve testing strategies and configuration management in the future.
