5. Comparative Experiment

5.1 Comparison with UniformLoadShedder + LeastLongTermMessageRate

5.1.1 Heterogeneous Environment

The same heterogeneous environment and load pressure as the previous test of UniformLoadShedder + LeastLongTermMessageRate were used, where machine XXX.34 is heterogeneous and significantly more powerful than the other three machines.

It can be observed that the traffic throughput and message rate of machine XXX.34 are significantly higher than those of the other machines.

The ratio of the maximum to the minimum of message rates and traffic throughput even reached 11. However, this is quite reasonable. When examining the resource utilization, it is found that the load on machine XXX.34 remains the lowest!

It can be seen that the resource utilization rate of XXX.34 is still less than half of that of the other machines. You might hope that the load of other machines, such as XXX.83, could be further distributed to XXX.34 to achieve a more balanced resource utilization rate. However, the current AvgShedder algorithm cannot achieve this level of optimization.

5.1.2 Load Fluctuation

I deployed periodic fluctuating consumption tasks:

It can be seen that no bundle unload was triggered at all! The stability is good.

5.2 Comparison with ThresholdShedder + LeastResourceUsageWithWeight

Next, we deployed the same test environment as ThresholdShedder + LeastResourceUsageWithWeight, where the machines are homogeneous, to compare the effects.

Three pressure testing tasks were launched as follows:

Upon commencing the pressure - testing tasks, it is evident that among the brokers, XXX.83 (the green line) handled the heaviest traffic and exhibited the highest CPU utilization, whereas XXX.161 (the blue line) processed the lightest traffic and had the lowest CPU utilization. The score difference between them was 63 - 38.5 = 24.5 > 15. Therefore, after checking consecutively 8 times (waiting for 8 minutes), load balancing was triggered, and XXX.83 and XXX.161 shared the traffic evenly.

Only one single bundle unload was needed for the cluster to enter a stable state, whereas ThresholdShedder + LeastResourceUsageWithWeight took 22 minutes and performed many incorrect bundle unloads.

Following the load - balancing operation, the ratio of the maximum to the minimum of message rates and traffic throughput within the cluster decreased from 2.5 to 1.5, demonstrating favorable outcomes.

Additionally, we observed a load fluctuation where the CPU usage of XXX.32 suddenly spiked to 86.5 and then quickly dropped back down. However, its traffic throughput remained unchanged. This could potentially be attributed to other processes that are deployed on the machine. However, irrespective of the cause, the load - balancing algorithm ought not to instantaneously initiate a bundle unload. AvgShedder managed to do this, further demonstrating its ability to handle load fluctuations.

5.2.1 Broker Scaling Down

Broker XXX.161 (blue line) was taken offline to observe the changes in cluster traffic.

It can be seen that the traffic offloaded from XXX.161 (blue line) was distributed to XXX.83 (green line), XXX.87 (orange line), and XXX.32 (red line), while XXX.206 (yellow line) did not receive any new traffic. This distribution result can also be seen in the logs:

2024-06-18T15:04:32,188+0800 [pulsar-2-18] DEBUG org.apache.pulsar.broker.loadbalance.impl.AvgShedder - expected broker:XXX.161:8081 is shutdown, candidates:[XXX.83:8081, XXX.32:8081, XXX.87:8081, XXX.206:8081]
2024-06-18T15:04:32,188+0800 [pulsar-2-18] DEBUG org.apache.pulsar.broker.loadbalance.impl.AvgShedder - Assignment details: brokers=[XXX.83:8081, XXX.206:8081, XXX.87:8081, XXX.32:8081], bundle=public/default/0x9c000000_0xa0000000, hashcode=1364617948, index=0
 
2024-06-18T15:04:32,204+0800 [pulsar-2-19] DEBUG org.apache.pulsar.broker.loadbalance.impl.AvgShedder - expected broker:XXX.161:8081 is shutdown, candidates:[XXX.83:8081, XXX.32:8081, XXX.87:8081, XXX.206:8081]
2024-06-18T15:04:32,204+0800 [pulsar-2-19] DEBUG org.apache.pulsar.broker.loadbalance.impl.AvgShedder - Assignment details: brokers=[XXX.83:8081, XXX.206:8081, XXX.87:8081, XXX.32:8081], bundle=public/default/0x40000000_0x44000000, hashcode=425532458, index=2
 
2024-06-18T15:04:32,215+0800 [pulsar-2-11] DEBUG org.apache.pulsar.broker.loadbalance.impl.AvgShedder - expected broker:XXX.161:8081 is shutdown, candidates:[XXX.83:8081, XXX.32:8081, XXX.87:8081, XXX.206:8081]
2024-06-18T15:04:32,216+0800 [pulsar-2-11] DEBUG org.apache.pulsar.broker.loadbalance.impl.AvgShedder - Assignment details: brokers=[XXX.83:8081, XXX.206:8081, XXX.87:8081, XXX.32:8081], bundle=public/default/0x98000000_0x9c000000, hashcode=3472910095, index=3

It can be seen that the distribution result is already quite balanced. A total of 3 bundles were selected and assigned to 3 different brokers. However, as observed from the monitoring data, the growth in traffic across different brokers varies, suggesting that the traffic throughput of different bundles differs. To achieve further optimization of the effect, augmenting the quantity of bundles within the cluster represents a viable solution. This can not only narrow the traffic gap between bundles but also enable some traffic to be allocated to the fourth broker.

Since the resource utilization gap did not continuously exceed 15, no bundle unload was triggered after the scaling down.

5.2.2 Broker Scaling Up

After the cluster was scaled down and had stabilized, machine XXX.87 was added back to test the performance of AvgShedder facing to scaling up.

It can be seen that a bundle unload was triggered shortly after the new machine was added. This is because the score difference between the highest-load and lowest-load brokers reached the high threshold of 40. Therefore, only two consecutive triggers are needed to perform a bundle unload (one trigger per minute, so only 2 minutes are needed instead of 8 minutes).

After the bundle unload was triggered, the load was evenly distributed between the highest-load broker XXX.161 (blue line) and the newly added broker XXX.87 (orange line).

Previous4. Production Effect Next6. Algorithm Rating

Last updated 3 months ago

Was this helpful?