mongo/docs/testing/network_fault_injection_mongobridge.md
Steve McClure 32e8f260de SERVER-124136 Format markdown via prettier: wrap lines and use width of 100 (#52231)
GitOrigin-RevId: 3305c1e2ee3a6a2c3a5b2b7883b0f491a59ed646
2026-04-21 19:20:11 +00:00

213 lines
6.3 KiB
Markdown

# Network Fault Injection Framework (mongobridge)
## Overview
[Mongobridge](https://github.com/mongodb/mongo/blob/e810af1916caaedb1cde8d1e1b74bb50b2461daf/src/mongo/tools/mongobridge_tool/bridge.cpp#L1)
is a network fault injection testing tool that allows test authors to intentionally simulate network
issues such as connection failures, message delays, or packet loss during communication to any node
in a cluster. It acts as a transparent proxy between MongoDB processes and their clients, enabling
controlled network fault injection for testing distributed system behavior.
## How It Works
When `ReplSetTest` or `ShardingTest` are instructed to use `mongobridge`, they will
[set up a mongobridge process](https://github.com/mongodb/mongo/blob/e810af1916caaedb1cde8d1e1b74bb50b2461daf/jstests/libs/replsettest.js#L2962)
for each node that
[creates a ProxiedConnection](https://github.com/mongodb/mongo/blob/e810af1916caaedb1cde8d1e1b74bb50b2461daf/src/mongo/tools/mongobridge_tool/bridge.cpp#L323-L324)
between the node and any clients (including other nodes in the cluster) attempting to communicate
with it. When test authors send a command to a node, mongobridge
[intercepts the command and applies any configured actions](https://github.com/mongodb/mongo/blob/e810af1916caaedb1cde8d1e1b74bb50b2461daf/src/mongo/tools/mongobridge_tool/bridge.cpp#L395-L430)
onto the commands before forwarding the command along to the node itself. This allows simple fault
injection from the test author's perspective.
## Quick Start
To use mongobridge in your tests:
1. **Enable mongobridge** in your test setup:
```javascript
let st = new ShardingTest({
shards: {rs0: {nodes: 2}},
mongos: 1,
config: 1,
useBridge: true, // Enable mongobridge
});
```
- **Test commands must be enabled**: Mongobridge's `*From` commands require
`enableTestCommands: true` (which is the default in test environments)
2. **Inject network faults** using bridge commands:
```javascript
// Delay messages by 5 seconds
st.rs0.getPrimary().delayMessagesFrom(st.rs0.getSecondary(), 5000);
// Reject all connections
st.rs0.getPrimary().rejectConnectionsFrom(st.rs0.getSecondary());
// Restore normal behavior
st.rs0.getPrimary().acceptConnectionsFrom(st.rs0.getSecondary());
```
3. Operations that depend on communication between the affected nodes will fail or timeout as
expected.
## What to keep in mind
Be aware that there are consequences to injecting network faults between nodes that can cause
downstream impact in (for example) heartbeats, sync source selection, and SDAM, and so after a fault
has been injected the test may not be in the state you expect it to be in for future commands. It is
best to keep mongobridge tests relatively short and targeted to ensure that flakiness due to these
faults doesn't impact the rest of your testing.
## Command Reference
Mongobridge supports four commands for network fault injection:
### `acceptConnectionsFrom(bridges)`
**Purpose**: Allows normal communication from specified sources
**Usage**:
```javascript
node.acceptConnectionsFrom(otherNode);
node.acceptConnectionsFrom([node1, node2, node3]); // Multiple nodes
```
**Effect**: Restores normal message forwarding (default state)
### `rejectConnectionsFrom(bridges)`
**Purpose**: Immediately closes connections from specified sources
**Usage**:
```javascript
node.rejectConnectionsFrom(otherNode);
```
**Effect**: New connections are rejected, existing connections are closed when a new request is sent
over them
**Use case**: Simulating complete network partitions
### `delayMessagesFrom(bridges, delayMs)`
**Purpose**: Delays message forwarding by specified milliseconds
**Usage**:
```javascript
node.delayMessagesFrom(otherNode, 5000); // 5 second delay
node.delayMessagesFrom(otherNode, 0); // Remove delay
```
**Parameters**:
- `delayMs`: Delay in milliseconds (0 to disable)
**Use case**: Simulating slow networks or testing timeout behavior
### `discardMessagesFrom(bridges, lossProbability)`
**Purpose**: Randomly discards messages with specified probability
**Usage**:
```javascript
node.discardMessagesFrom(otherNode, 0.5); // Drop 50% of messages
node.discardMessagesFrom(otherNode, 1.0); // Drop all messages
node.discardMessagesFrom(otherNode, 0.0); // Drop no messages
```
**Parameters**:
- `lossProbability`: Number between 0.0 (no loss) and 1.0 (total loss)
**Use case**: Simulating unreliable networks or packet loss
## Examples
### Basic Network Partition Test
```javascript
assert.eq(jsTest.options().enableTestCommands, true);
// Set up a replica set with mongobridge
let rst = new ReplSetTest({
nodes: 3,
useBridge: true,
settings: {electionTimeoutMillis: 2000, heartbeatIntervalMillis: 400},
});
rst.startSet();
rst.initiate();
// Partition the primary from secondaries
let primary = rst.getPrimary();
let secondaries = rst.getSecondaries();
primary.rejectConnectionsFrom(secondaries);
// Verify primary steps down due to lost majority
assert.soon(() => {
return rst.getPrimary() !== primary;
});
// Restore network
primary.acceptConnectionsFrom(secondaries);
rst.stopSet();
```
### Write Concern Timeout Test
```javascript
assert.eq(jsTest.options().enableTestCommands, true);
let st = new ShardingTest({
shards: {rs0: {nodes: 2}},
useBridge: true,
});
// Delay replication to cause write concern timeout
st.rs0.getPrimary().delayMessagesFrom(st.rs0.getSecondary(), 10000);
// This write should fail due to timeout
assert.commandFailed(
st.s0.getCollection("test.coll").insert(
{x: 1},
{
writeConcern: {w: 2, wtimeout: 5000},
},
),
);
// Restore normal replication
st.rs0.getPrimary().delayMessagesFrom(st.rs0.getSecondary(), 0);
st.stop();
```
### Simulating Packet Loss
```javascript
// Set up unreliable network with 30% packet loss
primary.discardMessagesFrom(secondary, 0.3);
// Operations may succeed or fail unpredictably
// Useful for testing retry logic and resilience
```
### Limitations
- **OP_QUERY exhaust**: Not supported for legacy exhaust queries (OP_MSG exhaust cursors are
supported)
- **Direct connections**: Only works when connections go through the bridge proxy
- **TLS support**: Mongobridge is not supported if the cluster is using TLS.
## See Also
- [mongobridge.js test example](https://github.com/mongodb/mongo/blob/e810af1916caaedb1cde8d1e1b74bb50b2461daf/jstests/noPassthrough/mongobridge/mongobridge.js#L1)