Dear All,
I am currently working purely with serf (without consul) and vivaldi network coordinates. My coordinates configuration values are as below;
func DefaultConfig() *Config {
return &Config{
Dimensionality: 2,
VivaldiErrorMax: 1.5,
VivaldiCE: 0.25,
VivaldiCC: 0.25,
AdjustmentWindowSize: 20,
HeightMin: 10.0e-6,
LatencyFilterSize: 3,
GravityRho: 150.0,
}
I need to testing if the nodes are drifting away indefinitely from the origin as per the vivaldi paper. However, in serf they have introduced “GravityRho” as a solution to stop this drift. I have the following code to check if this is working;;
package main
import (
"log"
"os"
"time"
"github.com/hashicorp/serf/client"
"github.com/hashicorp/serf/coordinate"
)
const (
serfRPCAddr = "127.0.0.1:7373"
samplePeriod = 1 * time.Minute
)
func main() {
// 1. Initialize logging
nodeLogFile, err := os.OpenFile("serf_node_drift.log", os.O_APPEND|os.O_CREATE|os.O_WRONLY, 0644)
if err != nil {
log.Fatal(err)
}
defer nodeLogFile.Close()
nodeLogger := log.New(nodeLogFile, "", log.LstdFlags)
systemLogFile, err := os.OpenFile("serf_system_drift.log", os.O_APPEND|os.O_CREATE|os.O_WRONLY, 0644)
if err != nil {
log.Fatal(err)
}
defer systemLogFile.Close()
systemLogger := log.New(systemLogFile, "", log.LstdFlags)
// 2. Connect to Serf
serfClient, err := client.ClientFromConfig(&client.Config{Addr: serfRPCAddr})
if err != nil {
log.Fatal(err)
}
defer serfClient.Close()
// 3. Create origin coordinate EXACTLY as Serf does internally
config := coordinate.DefaultConfig()
origin := coordinate.NewCoordinate(config)
for i := range origin.Vec {
origin.Vec[i] = 0.0 // Maintain Serf's origin Height (config.HeightMin)
}
origin.Adjustment = 0.0
// 4. Main monitoring loop
ticker := time.NewTicker(samplePeriod)
defer ticker.Stop()
for range ticker.C {
members, err := serfClient.Members()
if err != nil {
log.Printf("Member error: %v", err)
continue
}
var (
maxDrift float64
totalDrift float64
activeNodes int
vecSum = make([]float64, config.Dimensionality)
heightSum float64
adjustSum float64
)
for _, member := range members {
if member.Status != "alive" {
continue
}
coord, err := serfClient.GetCoordinate(member.Name)
if err != nil || coord == nil {
continue
}
// 5. Calculate TRUE drift using Serf's actual method
drift := coord.DistanceTo(origin).Seconds() * 1000 // ms
// Log individual node drift
nodeLogger.Printf("NODE_DRIFT node=%s drift_ms=%.2f vec=%v height=%.6f adj=%.6f",
member.Name, drift, coord.Vec, coord.Height, coord.Adjustment)
// Update metrics
if drift > maxDrift {
maxDrift = drift
}
totalDrift += drift
activeNodes++
// Accumulate components for true centroid calculation
for i := range coord.Vec {
vecSum[i] += coord.Vec[i]
}
heightSum += coord.Height
adjustSum += coord.Adjustment
}
// 6. Calculate system metrics
if activeNodes > 0 {
n := float64(activeNodes)
avgDrift := totalDrift / n
// Calculate TRUE centroid including all components
centroidVec := make([]float64, config.Dimensionality)
for i := range vecSum {
centroidVec[i] = vecSum[i] / n
}
centroidHeight := heightSum / n
centroidAdjust := adjustSum / n
// Construct centroid coordinate EXACTLY like real nodes
centroidCoord := &coordinate.Coordinate{
Vec: centroidVec,
Height: centroidHeight,
Adjustment: centroidAdjust,
Error: config.VivaldiErrorMax, // Not used in drift calc
}
// Calculate centroid drift using Serf's actual distance method
centroidDrift := centroidCoord.DistanceTo(origin).Seconds() * 1000
// Log metrics
systemLogger.Printf("SYSTEM_DRIFT nodes=%d max_ms=%.2f avg_ms=%.2f centroid_ms=%.2f",
activeNodes, maxDrift, avgDrift, centroidDrift)
}
}
}
This code basically does the following;
- Create an “origin” coordinate (0,0 position) just like Serf internally defines it.
- Every minute:
- Get the list of alive Serf members.
- For each alive node:
- Get its current Vivaldi coordinate.
- Calculate the distance (drift) from the origin.
- Log its drift.
- After checking all nodes:
- Calculate max drift, average drift, and the centroid drift.
I see the following results after running this for about 3 days;
2025/04/24 13:39:05 SYSTEM_DRIFT nodes=162 max_ms=58.00 avg_ms=21.38 centroid_ms=14.28
2025/04/24 13:40:05 SYSTEM_DRIFT nodes=162 max_ms=58.00 avg_ms=21.35 centroid_ms=14.33
2025/04/24 13:41:05 SYSTEM_DRIFT nodes=162 max_ms=58.00 avg_ms=21.12 centroid_ms=14.16
2025/04/24 13:42:05 SYSTEM_DRIFT nodes=162 max_ms=58.00 avg_ms=21.11 centroid_ms=14.18
2025/04/24 13:43:05 SYSTEM_DRIFT nodes=162 max_ms=58.00 avg_ms=21.16 centroid_ms=14.19
2025/04/24 13:44:05 SYSTEM_DRIFT nodes=162 max_ms=58.00 avg_ms=21.19 centroid_ms=14.22
2025/04/24 13:45:05 SYSTEM_DRIFT nodes=162 max_ms=58.00 avg_ms=21.22 centroid_ms=14.32
2025/04/24 13:46:05 SYSTEM_DRIFT nodes=162 max_ms=58.00 avg_ms=21.38 centroid_ms=14.41
2025/04/24 13:47:05 SYSTEM_DRIFT nodes=162 max_ms=58.00 avg_ms=21.42 centroid_ms=14.45
2025/04/24 13:48:05 SYSTEM_DRIFT nodes=162 max_ms=58.00 avg_ms=21.40 centroid_ms=14.26
<some entries have been removed to save space>
2025/04/28 08:29:05 SYSTEM_DRIFT nodes=162 max_ms=131.76 avg_ms=96.10 centroid_ms=95.47
2025/04/28 08:30:05 SYSTEM_DRIFT nodes=162 max_ms=131.76 avg_ms=96.05 centroid_ms=95.41
2025/04/28 08:31:05 SYSTEM_DRIFT nodes=162 max_ms=131.76 avg_ms=96.05 centroid_ms=95.43
2025/04/28 08:32:05 SYSTEM_DRIFT nodes=162 max_ms=131.76 avg_ms=96.17 centroid_ms=95.55
2025/04/28 08:33:05 SYSTEM_DRIFT nodes=162 max_ms=131.76 avg_ms=96.05 centroid_ms=95.43
2025/04/28 08:34:05 SYSTEM_DRIFT nodes=162 max_ms=130.67 avg_ms=96.05 centroid_ms=95.41
2025/04/28 08:35:05 SYSTEM_DRIFT nodes=162 max_ms=130.67 avg_ms=96.08 centroid_ms=95.44
2025/04/28 08:36:05 SYSTEM_DRIFT nodes=162 max_ms=130.67 avg_ms=96.06 centroid_ms=95.43
2025/04/28 08:37:05 SYSTEM_DRIFT nodes=162 max_ms=130.67 avg_ms=95.85 centroid_ms=95.21
2025/04/28 08:38:05 SYSTEM_DRIFT nodes=162 max_ms=130.67 avg_ms=95.86 centroid_ms=95.22
As per the results the max_ms, avg_ms and centroid_ms keeps increasing steadily. I would like to know if this is a normal behavior or a drift is happening here? Also, I would like to know if I am tracking it correctly with this code?
Thank you for any advices and help!