2. ThresholdShedder + LeastResourceUsageWithWeight
Last updated
Was this helpful?
Last updated
Was this helpful?
A pulsar cluster was built with 5 brokers and 30 bookies.
The load balancing-related configurations are as follows:
Using the combination of ThresholdShedder
and LeastResourceUsageWithWeight
, most of the configuration is set to default values, and the bundle split and even distribution features are also disabled.
Three pressure testing tasks were launched:
After starting the pressure testing tasks, it took 22 minutes for the cluster to reach a stable state and triggered 8 rounds of bundle unload.
To facilitate debugging, some logs were added. The log regarding the first bundle unload is presented as follows:
This log line prints the intermediate scores of all brokers (i.e., the maximum resource utilization of each broker currently, before using the historical weight scoring algorithm). Evidently, brokers at addresses XXX.83:8081
, XXX.32:8081
, and XXX.206:8081
are under high load, whereas the remaining two brokers are under low load.
This log line prints the final scores (i.e., the result of the historical scoring algorithm), average score, and threshold for all brokers.
From the first two log lines, it can be seen that before starting the pressure testing tasks, the load on each broker was low, so the scores of all brokers at this time were significantly different from the actual load. Only the score of XXX.206:8081
exceeded the threshold: 28.95% > 17.35% + 10.0%
. Consequently, a bundle unload operation was performed on it, giving rise to the following log:
Unload a bundle and immediately execute the placement policy LeastResourceUsageWithWeight
:
As can be seen, the unloaded bundle was assigned to the high-load broker XXX.83:8081
! This is a wrong load balancing decision. In this experiment, the probability of triggering this issue is extremely high, nearly inevitable. As depicted in the figure below, the problem has been triggered four times consecutively.
However, owing to the historical - weight scoring algorithm, the scores of all brokers can only gradually approximate their real load starting from around 20, making it difficult for the scores of different brokers to widen the gap. As a result, the LeastResourceUsageWithWeight
algorithm can only perform random allocation.
To increase the load on a single broker, two brokers were shut down, and an abnormal load balancing was observed.
Three rounds of load balancing were performed:
In the first round, bundles were unloaded from the highest-load yellow machine XXX.206:8081
to the green machine XXX.83:8081
. However, four bundle unload operations were performed in this round, causing the load on the yellow machine XXX.206:8081
to drop significantly and quickly become the broker with the lowest load, thus encountering the over unloading problem.
In the second round, 11 bundles were unloaded from the highest-load blue machine XXX.32:8081
and assigned to the high-load green machine XXX.83:8081
and the low-load machine XXX.206:8081
. In this process, the over unloading problem occurred again, and the blue machine XXX.32:8081
became the broker with the lowest load. At the same time, there was also an over placement problem, as the bundle was mistakenly assigned to the high-load green machine XXX.83:8081
.
In the third round, the bundle was unloaded from the highest-load green machine XXX.83:8081
and reloaded to the blue machine XXX.32:8081
. The cluster then entered a balanced state, with the entire process taking a total of 30 minutes.
With the help of broker logs, we can gain a deeper insight into the above process:
In the second round of bundle unload, bundles were unloaded from the highest-load XXX.32:8081
. However, an over placement problem was encountered as these bundles were unloaded to another high - load server, XXX.83:8081
.
Up to this point, we have experimentally verified the two core defects of ThresholdShedder + LeastResourceUsageWithWeight
:
Over placement problem
Over unloading problem
Both issues are exacerbated by the historical weight scoring algorithm. It is also evident that the load-balancing speed of ThresholdShedder + LeastResourceUsageWithWeight
is slow. Due to incorrect load balancing decisions, the system often needs to perform load-balancing repeatedly to eventually reach a stable state.
Given that the sum of any broker's score and 10 is greater than the average score of 10.7%, the candidate broker list remains empty, thereby triggering random allocation. This is the issue we described with : the candidate broker list can easily be empty, leading to random allocation.
The also contributes to the high probability of occurrence. This is due to the fact that the criterion for choosing candidate brokers is that a broker's score must be 10 points lower than the average score. This implies that there needs to be a considerable gap among the scores of different brokers.
It can be seen that the yellow machine frequently switches between high-load and low-load states. Its CPU utilization descends from 80% to 40%, and then ascends back to 80%. In this process, the was first triggered, followed by the .
Within the first 10 minutes of the first round of bundle unload, it can be observed that the system continously identified XXX.206:8081
as a high-load broker and unloaded bundles from this node, eventually leading to the over unloading problem. This is due to the characteristics of the , which causes the score of XXX.206:8081
to change very slowly. Even though the node has already unloaded the bundle and its real load has changed correspondingly, the score remains high, so it is still identified as a high-load broker and continues to be unloaded.