213 lines
6.3 KiB
Markdown
213 lines
6.3 KiB
Markdown
# Network Fault Injection Framework (mongobridge)
|
|
|
|
## Overview
|
|
|
|
[Mongobridge](https://github.com/mongodb/mongo/blob/e810af1916caaedb1cde8d1e1b74bb50b2461daf/src/mongo/tools/mongobridge_tool/bridge.cpp#L1)
|
|
is a network fault injection testing tool that allows test authors to intentionally simulate network
|
|
issues such as connection failures, message delays, or packet loss during communication to any node
|
|
in a cluster. It acts as a transparent proxy between MongoDB processes and their clients, enabling
|
|
controlled network fault injection for testing distributed system behavior.
|
|
|
|
## How It Works
|
|
|
|
When `ReplSetTest` or `ShardingTest` are instructed to use `mongobridge`, they will
|
|
[set up a mongobridge process](https://github.com/mongodb/mongo/blob/e810af1916caaedb1cde8d1e1b74bb50b2461daf/jstests/libs/replsettest.js#L2962)
|
|
for each node that
|
|
[creates a ProxiedConnection](https://github.com/mongodb/mongo/blob/e810af1916caaedb1cde8d1e1b74bb50b2461daf/src/mongo/tools/mongobridge_tool/bridge.cpp#L323-L324)
|
|
between the node and any clients (including other nodes in the cluster) attempting to communicate
|
|
with it. When test authors send a command to a node, mongobridge
|
|
[intercepts the command and applies any configured actions](https://github.com/mongodb/mongo/blob/e810af1916caaedb1cde8d1e1b74bb50b2461daf/src/mongo/tools/mongobridge_tool/bridge.cpp#L395-L430)
|
|
onto the commands before forwarding the command along to the node itself. This allows simple fault
|
|
injection from the test author's perspective.
|
|
|
|
## Quick Start
|
|
|
|
To use mongobridge in your tests:
|
|
|
|
1. **Enable mongobridge** in your test setup:
|
|
|
|
```javascript
|
|
let st = new ShardingTest({
|
|
shards: {rs0: {nodes: 2}},
|
|
mongos: 1,
|
|
config: 1,
|
|
useBridge: true, // Enable mongobridge
|
|
});
|
|
```
|
|
|
|
- **Test commands must be enabled**: Mongobridge's `*From` commands require
|
|
`enableTestCommands: true` (which is the default in test environments)
|
|
|
|
2. **Inject network faults** using bridge commands:
|
|
|
|
```javascript
|
|
// Delay messages by 5 seconds
|
|
st.rs0.getPrimary().delayMessagesFrom(st.rs0.getSecondary(), 5000);
|
|
|
|
// Reject all connections
|
|
st.rs0.getPrimary().rejectConnectionsFrom(st.rs0.getSecondary());
|
|
|
|
// Restore normal behavior
|
|
st.rs0.getPrimary().acceptConnectionsFrom(st.rs0.getSecondary());
|
|
```
|
|
|
|
3. Operations that depend on communication between the affected nodes will fail or timeout as
|
|
expected.
|
|
|
|
## What to keep in mind
|
|
|
|
Be aware that there are consequences to injecting network faults between nodes that can cause
|
|
downstream impact in (for example) heartbeats, sync source selection, and SDAM, and so after a fault
|
|
has been injected the test may not be in the state you expect it to be in for future commands. It is
|
|
best to keep mongobridge tests relatively short and targeted to ensure that flakiness due to these
|
|
faults doesn't impact the rest of your testing.
|
|
|
|
## Command Reference
|
|
|
|
Mongobridge supports four commands for network fault injection:
|
|
|
|
### `acceptConnectionsFrom(bridges)`
|
|
|
|
**Purpose**: Allows normal communication from specified sources
|
|
|
|
**Usage**:
|
|
|
|
```javascript
|
|
node.acceptConnectionsFrom(otherNode);
|
|
node.acceptConnectionsFrom([node1, node2, node3]); // Multiple nodes
|
|
```
|
|
|
|
**Effect**: Restores normal message forwarding (default state)
|
|
|
|
### `rejectConnectionsFrom(bridges)`
|
|
|
|
**Purpose**: Immediately closes connections from specified sources
|
|
|
|
**Usage**:
|
|
|
|
```javascript
|
|
node.rejectConnectionsFrom(otherNode);
|
|
```
|
|
|
|
**Effect**: New connections are rejected, existing connections are closed when a new request is sent
|
|
over them
|
|
|
|
**Use case**: Simulating complete network partitions
|
|
|
|
### `delayMessagesFrom(bridges, delayMs)`
|
|
|
|
**Purpose**: Delays message forwarding by specified milliseconds
|
|
|
|
**Usage**:
|
|
|
|
```javascript
|
|
node.delayMessagesFrom(otherNode, 5000); // 5 second delay
|
|
node.delayMessagesFrom(otherNode, 0); // Remove delay
|
|
```
|
|
|
|
**Parameters**:
|
|
|
|
- `delayMs`: Delay in milliseconds (0 to disable)
|
|
|
|
**Use case**: Simulating slow networks or testing timeout behavior
|
|
|
|
### `discardMessagesFrom(bridges, lossProbability)`
|
|
|
|
**Purpose**: Randomly discards messages with specified probability
|
|
|
|
**Usage**:
|
|
|
|
```javascript
|
|
node.discardMessagesFrom(otherNode, 0.5); // Drop 50% of messages
|
|
node.discardMessagesFrom(otherNode, 1.0); // Drop all messages
|
|
node.discardMessagesFrom(otherNode, 0.0); // Drop no messages
|
|
```
|
|
|
|
**Parameters**:
|
|
|
|
- `lossProbability`: Number between 0.0 (no loss) and 1.0 (total loss)
|
|
|
|
**Use case**: Simulating unreliable networks or packet loss
|
|
|
|
## Examples
|
|
|
|
### Basic Network Partition Test
|
|
|
|
```javascript
|
|
assert.eq(jsTest.options().enableTestCommands, true);
|
|
|
|
// Set up a replica set with mongobridge
|
|
let rst = new ReplSetTest({
|
|
nodes: 3,
|
|
useBridge: true,
|
|
settings: {electionTimeoutMillis: 2000, heartbeatIntervalMillis: 400},
|
|
});
|
|
rst.startSet();
|
|
rst.initiate();
|
|
|
|
// Partition the primary from secondaries
|
|
let primary = rst.getPrimary();
|
|
let secondaries = rst.getSecondaries();
|
|
primary.rejectConnectionsFrom(secondaries);
|
|
|
|
// Verify primary steps down due to lost majority
|
|
assert.soon(() => {
|
|
return rst.getPrimary() !== primary;
|
|
});
|
|
|
|
// Restore network
|
|
primary.acceptConnectionsFrom(secondaries);
|
|
|
|
rst.stopSet();
|
|
```
|
|
|
|
### Write Concern Timeout Test
|
|
|
|
```javascript
|
|
assert.eq(jsTest.options().enableTestCommands, true);
|
|
|
|
let st = new ShardingTest({
|
|
shards: {rs0: {nodes: 2}},
|
|
useBridge: true,
|
|
});
|
|
|
|
// Delay replication to cause write concern timeout
|
|
st.rs0.getPrimary().delayMessagesFrom(st.rs0.getSecondary(), 10000);
|
|
|
|
// This write should fail due to timeout
|
|
assert.commandFailed(
|
|
st.s0.getCollection("test.coll").insert(
|
|
{x: 1},
|
|
{
|
|
writeConcern: {w: 2, wtimeout: 5000},
|
|
},
|
|
),
|
|
);
|
|
|
|
// Restore normal replication
|
|
st.rs0.getPrimary().delayMessagesFrom(st.rs0.getSecondary(), 0);
|
|
|
|
st.stop();
|
|
```
|
|
|
|
### Simulating Packet Loss
|
|
|
|
```javascript
|
|
// Set up unreliable network with 30% packet loss
|
|
primary.discardMessagesFrom(secondary, 0.3);
|
|
|
|
// Operations may succeed or fail unpredictably
|
|
// Useful for testing retry logic and resilience
|
|
```
|
|
|
|
### Limitations
|
|
|
|
- **OP_QUERY exhaust**: Not supported for legacy exhaust queries (OP_MSG exhaust cursors are
|
|
supported)
|
|
- **Direct connections**: Only works when connections go through the bridge proxy
|
|
- **TLS support**: Mongobridge is not supported if the cluster is using TLS.
|
|
|
|
## See Also
|
|
|
|
- [mongobridge.js test example](https://github.com/mongodb/mongo/blob/e810af1916caaedb1cde8d1e1b74bb50b2461daf/jstests/noPassthrough/mongobridge/mongobridge.js#L1)
|