Compare commits

...

11 Commits

Author SHA1 Message Date
openeuler-ci-bot
1fdaf88326
!13 Update README and Makefile
From: @sundongx 
Reviewed-by: @kevinzhu1, @yezengruan 
Signed-off-by: @kevinzhu1
2022-12-09 07:27:16 +00:00
Dongxu Sun
94497b6f78 spec: Disable debug package for noarch
Currently, debug package is enabled in x86_64, let's
disable it.

Signed-off-by: Dongxu Sun <sundongxu3@huawei.com>
2022-12-09 11:40:21 +08:00
Dongxu Sun
c45c975504 docs/Makefile: Update README.md and Makefile
docs: Use chinese description instead of English in README.md
Makefile: Use strip option in gcc

Signed-off-by: Dongxu Sun <sundongxu3@huawei.com>
2022-12-09 09:59:42 +08:00
openeuler-ci-bot
45997bd16c
!11 qos: Some bugfixes for power_qos/cachembw_qos/cpu_qos
From: @sundongx 
Reviewed-by: @kevinzhu1 
Signed-off-by: @kevinzhu1
2022-09-09 09:04:22 +00:00
sundongxu
e11c92af46 qos: Some bugfixes for power_qos/cachembw_qos/cpu_qos
cpu_qos: Register reset_domain_bandwidth as exit func
after adding power_qos job
power_qos/cachembw_qos: Add type check for environment
variables

Signed-off-by: sundongxu <sundongxu3@huawei.com>
2022-09-09 16:55:34 +08:00
openeuler-ci-bot
a0d0bdd6db
!9 qos: bugfix for create pidfile
From: @sundongx 
Reviewed-by: @kevinzhu1 
Signed-off-by: @kevinzhu1
2022-08-25 09:08:26 +00:00
sundongxu
bd4c9df0b7 qos: bugfix for create pidfile
Create pidfile after os.fork in child process.

Signed-off-by: sundongxu <sundongxu3@huawei.com>
2022-08-25 16:43:51 +08:00
openeuler-ci-bot
9565dd0238
!7 qos: More bugfixes for qos management
From: @kevinzhu1 
Reviewed-by: @yezengruan 
Signed-off-by: @yezengruan
2022-08-19 03:23:45 +00:00
Keqian Zhu
52abc18f1c qos: More bugfixes for qos management
Take another VM stop reason to account, add aditional setting
for cpu QOS and add a job to sync VM pids to resctrl.

Signed-off-by: Keqian Zhu <zhukeqian1@huawei.com>
2022-08-12 18:28:14 +08:00
openeuler-ci-bot
4eb6fe52eb
!5 spec: Add missing dependencies of build and run
From: @kevinzhu1 
Reviewed-by: @yezengruan 
Signed-off-by: @yezengruan
2022-08-10 07:39:01 +00:00
Keqian Zhu
f60c625f4f spec: Add missing dependencies of build and run
Signed-off-by: Keqian Zhu <zhukeqian1@huawei.com>
2022-08-10 15:25:42 +08:00
9 changed files with 819 additions and 4 deletions

View File

@ -0,0 +1,32 @@
From d3d934e6358ba214bfbc81e12d6f38d287d2bc45 Mon Sep 17 00:00:00 2001
From: Dongxu Sun <sundongxu3@huawei.com>
Date: Thu, 8 Dec 2022 09:23:09 +0000
Subject: [PATCH 2/2] Makefile: Use strip option in gcc
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Add -s in gcc compile options and remove -g options
to keep skylark security and reduce binary file size.
Signed-off-by: sundongxu <sundongxu3@huawei.com>
---
Makefile | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/Makefile b/Makefile
index cc00862..f2c0a80 100644
--- a/Makefile
+++ b/Makefile
@@ -7,7 +7,7 @@ skylarkd: *.py */*.py
mv ../skylarkd .
libskylarkmsr.so: data_collector/get_msr.c
- gcc --share -fPIC -g -o libskylarkmsr.so data_collector/get_msr.c
+ gcc --share -fPIC -s -o libskylarkmsr.so data_collector/get_msr.c
install: skylarkd libskylarkmsr.so skylarkd.service skylarkd.sysconfig low_prio_machine.slice high_prio_machine.slice
install -T -D skylarkd $(DESTDIR)/usr/sbin/skylarkd
--
2.27.0

View File

@ -0,0 +1,323 @@
From fe004fc62e8d47835c96902b26fc3f6e088559e6 Mon Sep 17 00:00:00 2001
From: Dongxu Sun <sundongxu3@huawei.com>
Date: Tue, 16 Aug 2022 10:46:47 +0800
Subject: [PATCH 3/3] cachembw_qos: Add a job to sync VM pids to resctrl
Delete unnecessary libvirt event registrations, and poll
the tasks file to sync VM pids to resctrl task file.
Signed-off-by: Dongxu Sun <sundongxu3@huawei.com>
---
data_collector/datacollector.py | 7 --
data_collector/guestinfo.py | 7 --
qos_controller/cachembwcontroller.py | 22 ++---
skylark.py | 129 +++++----------------------
4 files changed, 28 insertions(+), 137 deletions(-)
diff --git a/data_collector/datacollector.py b/data_collector/datacollector.py
index aaddc42..8aa20ab 100644
--- a/data_collector/datacollector.py
+++ b/data_collector/datacollector.py
@@ -31,13 +31,6 @@ class DataCollector:
def set_static_power_info(self):
self.host_info.set_host_power_attribute()
- def reset_base_info(self, vir_conn):
- self.guest_info.clear_guest_info()
- self.guest_info.update_guest_info(vir_conn, self.host_info.host_topo)
-
- def reset_power_info(self):
- self.host_info.update_host_power_info()
-
def update_base_info(self, vir_conn):
self.guest_info.update_guest_info(vir_conn, self.host_info.host_topo)
diff --git a/data_collector/guestinfo.py b/data_collector/guestinfo.py
index 38fa827..97d9f20 100644
--- a/data_collector/guestinfo.py
+++ b/data_collector/guestinfo.py
@@ -125,13 +125,6 @@ class GuestInfo:
self.domain_online = []
self.running_domain_in_cpus = []
- def clear_guest_info(self):
- self.vm_dict.clear()
- self.low_prio_vm_dict.clear()
- self.vm_online_dict.clear()
- self.domain_online.clear()
- self.running_domain_in_cpus.clear()
-
def update_guest_info(self, conn, host_topo):
self.running_domain_in_cpus.clear()
self.vm_online_dict.clear()
diff --git a/qos_controller/cachembwcontroller.py b/qos_controller/cachembwcontroller.py
index 8be9329..bbfe08f 100644
--- a/qos_controller/cachembwcontroller.py
+++ b/qos_controller/cachembwcontroller.py
@@ -25,7 +25,6 @@ from data_collector.guestinfo import GuestInfo
from data_collector.hostinfo import ResctrlInfo
LOW_VMS_RESGROUP_PATH = "/sys/fs/resctrl/low_prio_machine"
-LOW_VMS_PID_CGRP_PATH = "/sys/fs/cgroup/pids/low_prio_machine.slice"
LOW_MBW_INIT_FLOOR = 0.1
LOW_MBW_INIT_CEIL = 0.2
LOW_CACHE_INIT_FLOOR = 1
@@ -56,11 +55,6 @@ class CacheMBWController:
ResgroupFileOperations.create_group_dir(LOW_VMS_RESGROUP_PATH)
self.__get_low_init_alloc(resctrl_info)
self.set_low_init_alloc(resctrl_info)
- with os.scandir(LOW_VMS_PID_CGRP_PATH) as it:
- for entry in it:
- if entry.is_file():
- continue
- self.__add_vm_pids(entry.name)
def __get_low_init_alloc(self, resctrl_info: ResctrlInfo):
low_vms_mbw_init = float(os.getenv("MIN_MBW_LOW_VMS"))
@@ -111,25 +105,21 @@ class CacheMBWController:
util.file_write(os.path.join(
LOW_VMS_RESGROUP_PATH, "schemata"), schemata_mbw_alloc)
- def domain_updated(self, domain, guest_info: GuestInfo):
- if domain.ID() in guest_info.low_prio_vm_dict:
- self.__add_vm_pids(guest_info.vm_dict[domain.ID()].cgroup_name)
-
@staticmethod
- def __add_vm_pids(vm_cgrp_name):
- tasks_path = os.path.join(LOW_VMS_PID_CGRP_PATH, vm_cgrp_name, "tasks")
+ def add_vm_pids(tasks_path):
if not os.access(tasks_path, os.R_OK):
LOGGER.warning(
"The path %s is not readable, please check." % tasks_path)
return
- LOGGER.info("Add %s pids to %s" %
- (vm_cgrp_name, os.path.join(LOW_VMS_RESGROUP_PATH, "tasks")))
+
+ resctrl_tsk_path = os.path.join(LOW_VMS_RESGROUP_PATH, "tasks")
+ LOGGER.debug("Add %s pids to %s" % (tasks_path, resctrl_tsk_path))
try:
with open(tasks_path) as tasks:
for task in tasks:
- util.file_write(os.path.join(LOW_VMS_RESGROUP_PATH, "tasks"), task)
+ util.file_write(resctrl_tsk_path, task)
except IOError as e:
- LOGGER.error("Failed to add VM(%s) pids to resctrl: %s" % (vm_cgrp_name, str(e)))
+ LOGGER.error("Failed to add %s pids to resctrl: %s" % (tasks_path, str(e)))
# If the VM doesn't stop, raise exception.
if os.access(tasks_path, os.F_OK):
raise
diff --git a/skylark.py b/skylark.py
index 8962ba5..c281a54 100644
--- a/skylark.py
+++ b/skylark.py
@@ -27,7 +27,7 @@ import stat
import subprocess
import sys
-from apscheduler.schedulers.background import BackgroundScheduler
+from apscheduler.schedulers.background import BlockingScheduler
from apscheduler.events import EVENT_JOB_ERROR
import libvirt
@@ -47,12 +47,7 @@ LIBVIRT_DRIVE_TYPE = None
PID_FILE = None
MSR_PATH = "/dev/cpu/0/msr"
PID_FILE_NAME = "/var/run/skylarkd.pid"
-
-STATE_TO_STRING = ['VIR_DOMAIN_EVENT_DEFINED', 'VIR_DOMAIN_EVENT_UNDEFINED',
- 'VIR_DOMAIN_EVENT_STARTED', 'VIR_DOMAIN_EVENT_SUSPENDED',
- 'VIR_DOMAIN_EVENT_RESUMED', 'VIR_DOMAIN_EVENT_STOPPED',
- 'VIR_DOMAIN_EVENT_SHUTDOWN', 'VIR_DOMAIN_EVENT_PMSUSPENDED',
- 'VIR_DOMAIN_EVENT_CRASHED', 'VIR_DOMAIN_EVENT_LAST']
+LOW_VMS_PID_CGRP_PATH = "/sys/fs/cgroup/pids/low_prio_machine.slice"
class QosManager:
@@ -63,17 +58,18 @@ class QosManager:
self.cpu_controller = CpuController()
self.net_controller = NetController()
self.cachembw_controller = CacheMBWController()
- self.has_job = False
def scheduler_listener(self, event):
if event.exception:
+ LOGGER.info("The Scheduler detects an exception, send SIGABRT and restart skylark...")
self.scheduler.remove_all_jobs()
+ os.kill(os.getpid(), signal.SIGABRT)
def init_scheduler(self):
- self.scheduler = BackgroundScheduler(logger=LOGGER)
+ self.scheduler = BlockingScheduler(logger=LOGGER)
if os.getenv("POWER_QOS_MANAGEMENT", "false").lower() == "true":
self.scheduler.add_job(self.__do_power_manage, trigger='interval', seconds=1, id='do_power_manage')
- self.has_job = True
+ self.scheduler.add_job(self.__do_resctrl_sync, trigger='interval', seconds=0.5, id='do_resctrl_sync')
self.scheduler.add_listener(self.scheduler_listener, EVENT_JOB_ERROR)
def init_data_collector(self):
@@ -95,19 +91,19 @@ class QosManager:
def start_scheduler(self):
self.scheduler.start()
- def reset_data_collector(self):
- self.scheduler.shutdown(wait=True)
- self.data_collector.reset_base_info(self.vir_conn)
- if os.getenv("POWER_QOS_MANAGEMENT", "false").lower() == "true":
- self.data_collector.reset_power_info()
- self.init_scheduler()
- self.start_scheduler()
-
def __do_power_manage(self):
self.data_collector.update_base_info(self.vir_conn)
self.data_collector.update_power_info()
self.power_analyzer.power_manage(self.data_collector, self.cpu_controller)
+ def __do_resctrl_sync(self):
+ with os.scandir(LOW_VMS_PID_CGRP_PATH) as it:
+ for entry in it:
+ if entry.is_file():
+ continue
+ tasks_path = os.path.join(LOW_VMS_PID_CGRP_PATH, entry.name, "tasks")
+ self.cachembw_controller.add_vm_pids(tasks_path)
+
def create_pid_file():
global PID_FILE
@@ -140,85 +136,27 @@ def remove_pid_file():
util.remove_file(PID_FILE.name)
-def register_callback_event(conn, event_id, callback_func, opaque):
- if conn is not None and event_id >= 0:
- try:
- return conn.domainEventRegisterAny(None, event_id, callback_func, opaque)
- except libvirt.libvirtError as error:
- LOGGER.warning("Register event exception %s" % str(error))
- return -1
-
-
-def deregister_callback_event(conn, callback_id):
- if conn is not None and callback_id >= 0:
- try:
- conn.domainEventDeregisterAny(callback_id)
- except libvirt.libvirtError as error:
- LOGGER.warning("Deregister event exception %s" % str(error))
-
-
-def event_lifecycle_callback(conn, dom, event, detail, opaque):
- LOGGER.info("Occur lifecycle event: domain %s(%d) state changed to %s" % (
- dom.name(), dom.ID(), STATE_TO_STRING[event]))
- vm_started = (event == libvirt.VIR_DOMAIN_EVENT_STARTED)
- vm_stopped = (event == libvirt.VIR_DOMAIN_EVENT_STOPPED)
- if vm_started or vm_stopped:
- QOS_MANAGER_ENTRY.reset_data_collector()
- if vm_started:
- QOS_MANAGER_ENTRY.cachembw_controller.domain_updated(dom,
- QOS_MANAGER_ENTRY.data_collector.guest_info)
- return 0
-
-
-def event_device_added_callback(conn, dom, dev_alias, opaque):
- device_name = str(dev_alias[0:4])
- if device_name == "vcpu":
- LOGGER.info("Occur device added event: domain %s(%d) add vcpu" % (dom.name(), dom.ID()))
- QOS_MANAGER_ENTRY.reset_data_collector()
- QOS_MANAGER_ENTRY.cachembw_controller.domain_updated(dom,
- QOS_MANAGER_ENTRY.data_collector.guest_info)
-
-
-def event_device_removed_callback(conn, dom, dev_alias, opaque):
- device_name = str(dev_alias[0:4])
- if device_name == "vcpu":
- LOGGER.info("Occur device removed event: domain %s(%d) removed vcpu" % (dom.name(), dom.ID()))
- QOS_MANAGER_ENTRY.reset_data_collector()
-
-
def sigterm_handler(signo, stack):
sys.exit(0)
+def sigabrt_handler(signo, stack):
+ sys.exit(1)
+
def func_daemon():
global LIBVIRT_CONN
global QOS_MANAGER_ENTRY
- event_lifecycle_id = -1
- event_device_added_id = -1
- event_device_removed_id = -1
-
signal.signal(signal.SIGTERM, sigterm_handler)
- signal.signal(signal.SIGABRT, sigterm_handler)
+ signal.signal(signal.SIGABRT, sigabrt_handler)
@atexit.register
def daemon_exit_func():
- deregister_callback_event(LIBVIRT_CONN, event_lifecycle_id)
- deregister_callback_event(LIBVIRT_CONN, event_device_added_id)
- deregister_callback_event(LIBVIRT_CONN, event_device_removed_id)
LIBVIRT_CONN.close()
remove_pid_file()
create_pid_file()
- try:
- if libvirt.virEventRegisterDefaultImpl() < 0:
- LOGGER.error("Failed to register event implementation!")
- sys.exit(1)
- except libvirt.libvirtError:
- LOGGER.error("System internal error!")
- sys.exit(1)
-
LOGGER.info("Try to open libvirtd connection")
try:
LIBVIRT_CONN = libvirt.open(LIBVIRT_URI)
@@ -227,40 +165,17 @@ def func_daemon():
LOGGER.error("System internal error, failed to open libvirtd connection!")
sys.exit(1)
- event_lifecycle_id = register_callback_event(LIBVIRT_CONN,
- libvirt.VIR_DOMAIN_EVENT_ID_LIFECYCLE,
- event_lifecycle_callback, None)
- event_device_added_id = register_callback_event(LIBVIRT_CONN,
- libvirt.VIR_DOMAIN_EVENT_ID_DEVICE_ADDED,
- event_device_added_callback, None)
- event_device_removed_id = register_callback_event(LIBVIRT_CONN,
- libvirt.VIR_DOMAIN_EVENT_ID_DEVICE_REMOVED,
- event_device_removed_callback, None)
- if event_lifecycle_id < 0 or event_device_added_id < 0 or event_device_removed_id < 0:
- LOGGER.error("Failed to register libvirt event %d, %d, %d"
- % (event_lifecycle_id, event_device_added_id, event_device_removed_id))
- sys.exit(1)
- LOGGER.info("Libvirtd connected and libvirt event registered.")
+ LOGGER.info("Libvirtd connected.")
QOS_MANAGER_ENTRY = QosManager(LIBVIRT_CONN)
QOS_MANAGER_ENTRY.init_scheduler()
QOS_MANAGER_ENTRY.init_data_collector()
QOS_MANAGER_ENTRY.init_qos_analyzer()
QOS_MANAGER_ENTRY.init_qos_controller()
+
+ LOGGER.info("QoS management ready to start.")
QOS_MANAGER_ENTRY.start_scheduler()
- LOGGER.info("QoS management started.")
-
- while LIBVIRT_CONN.isAlive():
- if not QOS_MANAGER_ENTRY.scheduler.get_jobs() and QOS_MANAGER_ENTRY.has_job:
- LOGGER.error("The Scheduler detects an exception, process will exit!")
- break
- try:
- if libvirt.virEventRunDefaultImpl() < 0:
- LOGGER.warning("Failed to run event loop")
- break
- except libvirt.libvirtError as error:
- LOGGER.warning("Run libvirt event loop failed: %s" % str(error))
- break
+
sys.exit(1)
--
2.33.0

View File

@ -0,0 +1,43 @@
From 14b8a140c5794b28ccbb1c396924a9363767b7ea Mon Sep 17 00:00:00 2001
From: Keqian Zhu <zhukeqian1@huawei.com>
Date: Thu, 4 Aug 2022 15:34:59 -0400
Subject: [PATCH 2/3] cpu_qos: Add aditional setting for cpu QOS
Set overload_detect_period and offline_wait_interval, so as
to prevent low priority vm to be hungry too much time.
Signed-off-by: Keqian Zhu <zhukeqian1@huawei.com>
---
qos_controller/cpucontroller.py | 8 +++++++-
1 file changed, 7 insertions(+), 1 deletion(-)
diff --git a/qos_controller/cpucontroller.py b/qos_controller/cpucontroller.py
index f857973..f2a67e0 100644
--- a/qos_controller/cpucontroller.py
+++ b/qos_controller/cpucontroller.py
@@ -23,6 +23,10 @@ import util
LOW_PRIORITY_SLICES_PATH = "/sys/fs/cgroup/cpu/low_prio_machine.slice"
LOW_PRIORITY_QOS_LEVEL = -1
+OVERLOAG_DETECT_PERIOD_PATH = "/proc/sys/kernel/qos_overload_detect_period_ms"
+OVERLOAG_DETECT_PERIOD_MS = 100
+OFFLINE_WAIT_INTERVAL_PATH = "/proc/sys/kernel/qos_offline_wait_interval_ms"
+OFFLINE_WAIT_INTERVAL_MS = 100
MIN_QUOTA_US = 0
@@ -36,8 +40,10 @@ class CpuController:
qos_level_path = os.path.join(LOW_PRIORITY_SLICES_PATH, "cpu.qos_level")
try:
util.file_write(qos_level_path, str(LOW_PRIORITY_QOS_LEVEL))
+ util.file_write(OVERLOAG_DETECT_PERIOD_PATH, str(OVERLOAG_DETECT_PERIOD_MS))
+ util.file_write(OFFLINE_WAIT_INTERVAL_PATH, str(OFFLINE_WAIT_INTERVAL_MS))
except IOError as error:
- LOGGER.error("Failed to set low priority cpu qos level: %s" % str(error))
+ LOGGER.error("Failed to configure CPU QoS for low priority VMs: %s" % str(error))
raise
def limit_domain_bandwidth(self, guest_info, quota_threshold, abnormal_threshold):
--
2.33.0

View File

@ -0,0 +1,113 @@
From a165c7131e09749401b01b3a7d568e96a9ca8b3a Mon Sep 17 00:00:00 2001
From: Dongxu Sun <sundongxu3@huawei.com>
Date: Sat, 3 Sep 2022 15:02:47 +0800
Subject: [PATCH 1/2] cpu_qos: Register reset_domain_bandwidth as exit func
after adding power_qos job
Currently, the domain bandwidth can be changed by
skylark only in power_qos job, so reset_domain_bandwidth
should be resgistered after adding power_qos job.
Besides, there is no need to reset domain bandwidth
when the domain cgroup path does not exist, since the
domain may have been stopped.
Signed-off-by: Dongxu Sun <sundongxu3@huawei.com>
---
qos_controller/cpucontroller.py | 21 ++++++++++-----------
skylark.py | 3 ++-
util.py | 5 +++--
3 files changed, 15 insertions(+), 14 deletions(-)
diff --git a/qos_controller/cpucontroller.py b/qos_controller/cpucontroller.py
index f2a67e0..26b1240 100644
--- a/qos_controller/cpucontroller.py
+++ b/qos_controller/cpucontroller.py
@@ -63,12 +63,12 @@ class CpuController:
quota_path = os.path.join(vm_slices_path, domain.cgroup_name, "cpu.cfs_quota_us")
try:
- util.file_write(quota_path, str(domain_quota_us))
+ util.file_write(quota_path, str(domain_quota_us), log=False)
except IOError as error:
- LOGGER.error("Failed to limit domain %s(%d) cpu bandwidth: %s"
- % (domain.domain_name, domain.domain_id, str(error)))
# If VM doesn't stop, raise exception.
if os.access(quota_path, os.F_OK):
+ LOGGER.error("Failed to limit domain %s(%d) cpu bandwidth: %s"
+ % (domain.domain_name, domain.domain_id, str(error)))
raise
else:
LOGGER.info("Domain %s(%d) cpu bandwidth was limitted to %s"
@@ -83,12 +83,12 @@ class CpuController:
quota_path = os.path.join(vm_slices_path, domain.cgroup_name, "cpu.cfs_quota_us")
try:
- util.file_write(quota_path, str(initial_bandwidth))
+ util.file_write(quota_path, str(initial_bandwidth), log=False)
except IOError as error:
- LOGGER.error("Failed to recovery domain %s(%d) cpu bandwidth: %s!"
- % (domain.domain_name, domain.domain_id, str(error)))
# If VM doesn't stop, raise exception.
if os.access(quota_path, os.F_OK):
+ LOGGER.error("Failed to recovery domain %s(%d) cpu bandwidth: %s!"
+ % (domain.domain_name, domain.domain_id, str(error)))
raise
else:
LOGGER.info("Domain %s(%d) cpu bandwidth was recoveried to %s"
@@ -101,13 +101,12 @@ class CpuController:
domain = guest_info.low_prio_vm_dict.get(domain_id)
initial_bandwidth = domain.global_quota_config
quota_path = os.path.join(vm_slices_path, domain.cgroup_name, "cpu.cfs_quota_us")
-
try:
- util.file_write(quota_path, str(initial_bandwidth))
+ util.file_write(quota_path, str(initial_bandwidth), log=False)
except IOError:
- LOGGER.error("Failed to reset domain %s(%d) cpu bandwidth to its initial bandwidth %s!"
- % (domain.domain_name, domain.domain_id, initial_bandwidth))
- # This is on exiting path, make no sense to raise exception.
+ if os.access(quota_path, os.F_OK):
+ LOGGER.error("Failed to reset domain %s(%d) cpu bandwidth to its initial bandwidth %s!"
+ % (domain.domain_name, domain.domain_id, initial_bandwidth))
else:
LOGGER.info("Domain %s(%d) cpu bandwidth was reset to %s"
% (domain.domain_name, domain.domain_id, initial_bandwidth))
diff --git a/skylark.py b/skylark.py
index 6224f9b..2ec9862 100644
--- a/skylark.py
+++ b/skylark.py
@@ -84,8 +84,9 @@ class QosManager:
def init_qos_controller(self):
self.cpu_controller.set_low_priority_cgroup()
+ if os.getenv("POWER_QOS_MANAGEMENT", "false").lower() == "true":
+ atexit.register(self.cpu_controller.reset_domain_bandwidth, self.data_collector.guest_info)
self.cachembw_controller.init_cachembw_controller(self.data_collector.host_info.resctrl_info)
- atexit.register(self.cpu_controller.reset_domain_bandwidth, self.data_collector.guest_info)
self.net_controller.init_net_controller()
def start_scheduler(self):
diff --git a/util.py b/util.py
index 70f6f5a..2b8c3db 100644
--- a/util.py
+++ b/util.py
@@ -31,13 +31,14 @@ def file_read(file_path):
raise
-def file_write(file_path, value):
+def file_write(file_path, value, log=True):
try:
with open(file_path, 'wb') as file:
file.truncate()
file.write(str.encode(value))
except FileNotFoundError as error:
- LOGGER.error(str(error))
+ if log:
+ LOGGER.error(str(error))
raise
--
2.17.1

View File

@ -0,0 +1,29 @@
From 2552ff970feaddc6deda6c83298f75eae59bf6ec Mon Sep 17 00:00:00 2001
From: Dongxu Sun <sundongxu3@huawei.com>
Date: Wed, 7 Dec 2022 09:04:22 +0000
Subject: [PATCH 1/2] docs: Use chinese description instead of English in
README.md
Use chinese description in README.md.
Signed-off-by: Dongxu Sun <sundongxu3@huawei.com>
---
README.md | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/README.md b/README.md
index d51b8fd..9bdcc1c 100644
--- a/README.md
+++ b/README.md
@@ -1,7 +1,7 @@
# skylark
#### 介绍
-Skylark is a next-generation QoS-aware scheduler which provides coordinated resource scheduling for co-located applications with different QoS requirements. Typical applications are VM and Container. The architecture is highly scalable, so it's easy to be extended to support new types of applications and resources in the future.
+Skylark 是新一代 QoS 感知的资源调度器,可为不同 QoS 要求的混部业务提供合适的资源调度。典型的业务包含虚拟机和容器。本组件具有丰富的可扩展性,因此易于支持未来可能出现的新型业务和资源。
#### 软件架构
--
2.27.0

View File

@ -0,0 +1,62 @@
From 12d0dd3662a21acdde8e2f0264ed4c8aec6e3138 Mon Sep 17 00:00:00 2001
From: Dongxu Sun <sundongxu3@huawei.com>
Date: Wed, 24 Aug 2022 15:47:09 +0800
Subject: [PATCH] framework: create pidfile after os.fork in child process
To prevent the pidfile from being read by systemd when Skylark
does not initialize the pidfile, move create_pid_file() after
os.fork in child process.
Signed-off-by: Dongxu Sun <sundongxu3@huawei.com>
---
skylark.py | 13 +++++++------
1 file changed, 7 insertions(+), 6 deletions(-)
diff --git a/skylark.py b/skylark.py
index c281a54..6224f9b 100644
--- a/skylark.py
+++ b/skylark.py
@@ -112,15 +112,16 @@ def create_pid_file():
os.fchmod(fd, stat.S_IRUSR | stat.S_IWUSR)
os.close(fd)
try:
- PID_FILE = open(PID_FILE_NAME, 'w')
+ PID_FILE = open(PID_FILE_NAME, 'a')
except IOError:
LOGGER.error("Failed to open pid file")
- exit(1)
+ sys.exit(1)
try:
fcntl.flock(PID_FILE.fileno(), fcntl.LOCK_EX | fcntl.LOCK_NB)
except IOError:
LOGGER.error("A running service instance already creates the pid file! This service will exit!")
+ PID_FILE.close()
os._exit(0)
process_pid = os.getpid()
@@ -153,9 +154,6 @@ def func_daemon():
@atexit.register
def daemon_exit_func():
LIBVIRT_CONN.close()
- remove_pid_file()
-
- create_pid_file()
LOGGER.info("Try to open libvirtd connection")
try:
@@ -186,7 +184,10 @@ def create_daemon():
LOGGER.error('Fork daemon process failed: %d (%s)' % (error.errno, error.strerror))
os._exit(1)
else:
- if pid:
+ if pid == 0:
+ atexit.register(remove_pid_file)
+ create_pid_file()
+ else:
os._exit(0)
os.chdir('/')
os.umask(0)
--
2.17.1

View File

@ -0,0 +1,96 @@
From bdd805eec082062e042acda6caf38ca17dbaec50 Mon Sep 17 00:00:00 2001
From: Keqian Zhu <zhukeqian1@huawei.com>
Date: Thu, 4 Aug 2022 14:46:33 -0400
Subject: [PATCH 1/3] guestinfo: Take another VM stop reason to account
When VM is closed by OpenStack, the exception code is not
VIR_ERR_NO_DOMAIN but exception message contains "domain
is not running".
And refactor code to make it more readable.
Signed-off-by: Keqian Zhu <zhukeqian1@huawei.com>
---
data_collector/guestinfo.py | 48 +++++++++++++++++--------------------
1 file changed, 22 insertions(+), 26 deletions(-)
diff --git a/data_collector/guestinfo.py b/data_collector/guestinfo.py
index 415b3f6..38fa827 100644
--- a/data_collector/guestinfo.py
+++ b/data_collector/guestinfo.py
@@ -28,6 +28,7 @@ DEFAULT_PRIORITY = "machine"
HIGH_PRIORITY = "high_prio_machine"
LOW_PRIORITY = "low_prio_machine"
PIDS_CGRP_PATH = "/sys/fs/cgroup/pids"
+DOMAIN_STOP_MSG = "domain is not running"
class DomainInfo:
@@ -141,43 +142,38 @@ class GuestInfo:
self.running_domain_in_cpus.append([])
self.get_all_active_domain(conn)
-
for dom in self.domain_online:
self.vm_online_dict[dom.ID()] = dom
+ # Remove ever see but now stopped domains
+ for vm_id in list(self.vm_dict):
+ if vm_id not in self.vm_online_dict:
+ del self.vm_dict[vm_id]
+
for vm_id in self.vm_online_dict:
- ret = -1
- if vm_id in self.vm_dict:
- try:
+ try:
+ if vm_id in self.vm_dict:
ret = self.vm_dict.get(vm_id).update_domain_info(self.vm_online_dict.get(vm_id), host_topo)
- except libvirt.libvirtError as e:
- if e.get_error_code() != libvirt.VIR_ERR_NO_DOMAIN:
- raise
- if ret < 0:
- del self.vm_dict[vm_id]
- continue
- else:
- try:
- vm_info = DomainInfo()
- ret = vm_info.set_domain_attribute(self.vm_online_dict.get(vm_id), host_topo)
- except libvirt.libvirtError as e:
- if e.get_error_code() != libvirt.VIR_ERR_NO_DOMAIN:
- raise
- if ret < 0:
- continue
- self.vm_dict[vm_id] = vm_info
+ else:
+ self.vm_dict[vm_id] = DomainInfo()
+ ret = self.vm_dict.get(vm_id).set_domain_attribute(self.vm_online_dict.get(vm_id), host_topo)
+ except libvirt.libvirtError as e:
+ ret = -1
+ # If domain doesn't stop, raise exception
+ if e.get_error_code() != libvirt.VIR_ERR_NO_DOMAIN and \
+ DOMAIN_STOP_MSG not in e.get_error_message():
+ raise
+ if ret < 0:
+ del self.vm_dict[vm_id]
+ continue
+ if self.vm_dict.get(vm_id).priority == 1:
+ self.low_prio_vm_dict[vm_id] = self.vm_dict.get(vm_id)
for cpu in range(host_topo.max_cpu_nums):
self.running_domain_in_cpus[cpu].append((self.vm_dict.get(vm_id).cpu_usage[cpu],
self.vm_dict.get(vm_id).domain_id,
self.vm_dict.get(vm_id).domain_name,
self.vm_dict.get(vm_id).priority))
- for vm_id in list(self.vm_dict):
- if vm_id not in self.vm_online_dict:
- del self.vm_dict[vm_id]
- elif vm_id not in self.low_prio_vm_dict and self.vm_dict.get(vm_id).priority == 1:
- self.low_prio_vm_dict[vm_id] = self.vm_dict.get(vm_id)
-
def get_all_active_domain(self, conn):
try:
self.domain_online = conn.listAllDomains(flags=libvirt.VIR_CONNECT_LIST_DOMAINS_ACTIVE)
--
2.33.0

View File

@ -0,0 +1,82 @@
From 931b1d3767f6c62639d46cc51f9a831cba112de3 Mon Sep 17 00:00:00 2001
From: Dongxu Sun <sundongxu3@huawei.com>
Date: Sat, 3 Sep 2022 16:39:51 +0800
Subject: [PATCH 2/2] power_qos/cachembw_qos: Add type check for environment
variables
Add type check for environment variables.
Signed-off-by: Dongxu Sun <sundongxu3@huawei.com>
---
qos_analyzer/poweranalyzer.py | 16 ++++++++++------
qos_controller/cachembwcontroller.py | 9 +++++++--
2 files changed, 17 insertions(+), 8 deletions(-)
diff --git a/qos_analyzer/poweranalyzer.py b/qos_analyzer/poweranalyzer.py
index 23f6369..04fe51c 100644
--- a/qos_analyzer/poweranalyzer.py
+++ b/qos_analyzer/poweranalyzer.py
@@ -17,6 +17,7 @@ Description: This file is used for providing a power analyzer
# @code
import os
+import sys
from logger import LOGGER
from qos_controller import cpucontroller
@@ -34,13 +35,16 @@ class PowerAnalyzer:
self.qos_controller = cpucontroller.CpuController()
def set_hotspot_threshold(self, data_collector):
- self.tdp_threshold = float(os.getenv("TDP_THRESHOLD"))
- self.freq_threshold = float(os.getenv("FREQ_THRESHOLD"))
- self.abnormal_threshold = int(os.getenv("ABNORMAL_THRESHOLD"))
- self.quota_threshold = float(os.getenv("QUOTA_THRESHOLD"))
+ try:
+ self.tdp_threshold = float(os.getenv("TDP_THRESHOLD", "0.98"))
+ self.freq_threshold = float(os.getenv("FREQ_THRESHOLD", "0.98"))
+ self.abnormal_threshold = int(os.getenv("ABNORMAL_THRESHOLD", "3"))
+ self.quota_threshold = float(os.getenv("QUOTA_THRESHOLD", "0.9"))
+ except ValueError:
+ LOGGER.error("Threshold parameter type is incorrect, please check.")
+ sys.exit(1)
self.__check_threshold_validity()
-
- self.freq_threshold = float(os.getenv("FREQ_THRESHOLD")) * data_collector.host_info.cpu_turbofreq_mhz
+ self.freq_threshold = self.freq_threshold * data_collector.host_info.cpu_turbofreq_mhz
LOGGER.info("Frequency threshold is %.2f, abnormal times threshold is %d, bandwidth threshold is %.2f"
% (self.freq_threshold, self.abnormal_threshold, self.quota_threshold))
diff --git a/qos_controller/cachembwcontroller.py b/qos_controller/cachembwcontroller.py
index bbfe08f..a56ca59 100644
--- a/qos_controller/cachembwcontroller.py
+++ b/qos_controller/cachembwcontroller.py
@@ -17,6 +17,7 @@ Description: This file is used for control CACHE/MBW of low priority vms
# @code
import os
+import sys
import errno
import util
@@ -57,11 +58,15 @@ class CacheMBWController:
self.set_low_init_alloc(resctrl_info)
def __get_low_init_alloc(self, resctrl_info: ResctrlInfo):
- low_vms_mbw_init = float(os.getenv("MIN_MBW_LOW_VMS"))
+ try:
+ low_vms_mbw_init = float(os.getenv("MIN_MBW_LOW_VMS", "0.1"))
+ low_vms_cache_init = int(os.getenv("MIN_LLC_WAYS_LOW_VMS", "2"))
+ except ValueError:
+ LOGGER.error("MIN_MBW_LOW_VMS or MIN_LLC_WAYS_LOW_VMS parameter type is invalid.")
+ sys.exit(1)
if not LOW_MBW_INIT_FLOOR <= low_vms_mbw_init <= LOW_MBW_INIT_CEIL:
LOGGER.error("Invalid environment variables: MIN_MBW_LOW_VMS")
raise Exception
- low_vms_cache_init = int(os.getenv("MIN_LLC_WAYS_LOW_VMS"))
if not LOW_CACHE_INIT_FLOOR <= low_vms_cache_init <= LOW_CACHE_INIT_CEIL:
LOGGER.error("Invalid environment variables: MIN_LLC_WAYS_LOW_VMS")
raise Exception
--
2.17.1

View File

@ -1,24 +1,40 @@
Name: skylark
Version: 1.0.0
Release: 2
Release: 7
Summary: Skylark is a next-generation QoS-aware scheduler.
License: Mulan PSL v2
URL: https://gitee.com/openeuler/skylark
Source0: %{name}-%{version}.tar.gz
BuildRequires: python3-devel make gcc coreutils
Patch0001: guestinfo-Take-another-VM-stop-reason-to-account.patch
Patch0002: cpu_qos-Add-aditional-setting-for-cpu-QOS.patch
Patch0003: cachembw_qos-Add-a-job-to-sync-VM-pids-to-resctrl.patch
Patch0004: framework-create-pidfile-after-os.fork-in-child-proc.patch
Patch0005: cpu_qos-register-reset_domain_bandwidth-as-exit-func.patch
Patch0006: power_qos-cachembw_qos-Add-type-check-for-environmen.patch
Patch0007: docs-Use-chinese-description-instead-of-English-in-R.patch
Patch0008: Makefile-Use-strip-option-in-gcc.patch
BuildRequires: python3-devel make gcc coreutils systemd-units
Requires: python3 python3-APScheduler python3-libvirt
# For resource partition management
Requires: systemd >= 249-32 libvirt >= 1.0.5
# For service management
Requires(post): systemd-units
Requires(post): systemd-sysv
Requires(preun): systemd-units
Requires(postun): systemd-units
%description
Skylark is a next-generation QoS-aware scheduler which provides coordinated resource scheduling for co-located applications with different QoS requirements.
%ifnarch x86_64
%global debug_package %{nil}
%endif
%prep
%setup -q
%autopatch -p1
%build
@ -46,6 +62,25 @@ make install DESTDIR=%{buildroot}
%changelog
* Fri Dec 09 2022 Dongxu Sun <sundongxu3@huawei.com> - 1.0.0-7
- docs-Use-chinese-description-instead-of-English-in-R.patch
- Makefile-Use-strip-option-in-gcc.patch
* Sat Sep 03 2022 Dongxu Sun <sundongxu3@huawei.com> - 1.0.0-6
- cpu_qos: Register reset_domain_bandwidth as exit func after adding power_qos job
- power_qos/cachembw_qos: Add type check for environment variables
* Thu Aug 25 2022 Dongxu Sun <sundongxu3@huawei.com> - 1.0.0-5
- framework: create pidfile after os.fork in child process
* Fri Aug 19 2022 Keqian Zhu <zhukeqian1@huawei.com> - 1.0.0-4
- guestinfo: Take another VM stop reason to account
- cpu_qos: Add aditional setting for cpu QOS
- cachembw_qos: Add a job to sync VM pids to resctrl
* Wed Aug 10 2022 Keqian Zhu <zhukeqian1@huawei.com> - 1.0.0-3
- spec: Add missing dependencies of build and run
* Fri Aug 05 2022 Keqian Zhu <zhukeqian1@huawei.com> - 1.0.0-2
- Update baseline