Information Security, Web, Networks and Systems

Saturday, February 14, 2015

Implementation of Smart Cloud Scheduler - Part II

11:34 PM Posted by Deepal Jayasekara , , , , , No comments
In this post, I will discuss implementation details of the Smart Cloud Scheduler which was described in a previous post. Let's look at the component based architecture of the Smart Cloud Scheduler.

Design and Implementation

Web Frontend

Our resource scheduler provides two access methods for the cloud users. One method is via the Web Interface. A user can compose a resource allocation request via the web interface and submit. Web interface internally access the API provided by the resource scheduler.

API Endpoint

API endpoint is a REST API to manage functionality of the resource scheduler. This REST API provides functions to issue resource allocation requests and perform admin/user functions. Following are the functions that are currently supported by the API.

Create User - POST /admin/createUser
Login - POST /login
Issue a request - POST /request
Read configuration - GET /admin/configuration
Update configuration - POST /admin/configuration

Creating a user - 

Creating a User is only allowed for Administrator users. Create User request should be sent as a POST request to the URL /admin/createUser. Create User request body should have following format:

    "username": "User 1",
    "password": "passw@rd"
    "userinfo": {
        "firstName": "User",
        "lastName": "One"
    "priority": 3,
    "admin": true

SHA-256 algorithm is used to hash passwords to store in the database and the timestamp at the point where the account was created is used as the salt for hashing. A successful creation of a new user account will return a response similar to the following:

Login - 

Login function should be performed before sending any Resource Allocation requests to the API. Users should send a POST request to the /login with JSON request body including the information about the user account to be created. Login function accepts username and password in a JSON request and a successful login will return a response including the session ID to be used in future requests.

Request body format:


Response body format:

    "message":"Login successful",

Login function will create a session ID and store it in the database for future session validation and it will be sent back to the user in the response as above. User should include above session ID in future Resource Allocation Requests in order to get authenticated.

Issue a request - 

Once a user is logged in, he/she can issue Resource Allocation Requests to the Resource Scheduler using this method in API. Resource allocation Request should be composed as a XML document which describes the resources the user requires. This request should be sent as a POST request to the URL /request with Content-Type header set to application/xml. We are currently using XML as the Resource Description Language. But, to be consistent with other API methods, we are currently implementing this API method to support requests with Content-Type set to application/json. To differentiate these two types, Content-Type header is required in the POST request for Resource Allocation. Internally, Resource Scheduler selects appropriate parser (XML or JSON) according to the Content-Type header.

Following is a sample Resource Allocation Request:


Read Configuration - 

Configuration information includes the settings related to all components in the resource scheduler. To access configuration information, user should be an administrator. This API method can be accessed by sending a GET request to the URL /request. Configuration information can be retrieved as a JSON document.

Write Configuration - 

Writing a new configuration can be done by sending a POST request to the URL /request with a document containing configuration information as a JSON document. This method is also protected and only administrator users are allowed to access the method.

Authentication and Authorization service

Authentication service provides authentication and authorization services for the resource scheduler. API authentication is provided by a session key validation mechanism. Users need to login into the resource scheduler before issuing any request. Once logged in, Authentication service generates and returns a persistent session key which should be re-sent in following requests. Generated session key is stored in MongoDB and when a user issues a request, validity of the session key is evaluated by the Authentication service and identify the user’s privileges.
Passwords are stored in database as salted SHA-256 hashes where timestamp at the creation of user account is used as the salt. In the first implementation of the Authentication Service, we used MD5 as the hashing method but then updated to SHA-256 for better security. Only the login() and createUser() methods are provided by Authentication service to outside via the REST API, and authorizeResourceRequest() function is used internally by VM scheduler to authenticate and authorize an incoming resource request. Following diagram is a graphical representation of Authentication and Authorization services provided by the module.

Core Scheduler

Host Filter

Host Filter performs the filtration of hosts according to their monitored statistic data through Zabbix. That filtered hosts are then sent to VM scheduler as the candidate hosts for scheduling of a particular job. 

Complete steps taken by the Host Filer are illustrated below.
  • Host Filter receives the resource request and all host information. This host information is monitored through Zabbix and it contains all host information in the infrastructure. For every host, an ID is created in Zabbix.
  • Statistics are collected for specified data items in each host. Monitored items includes, memory information, CPU load and CPU utilization. For each item, history values are also collected apart from the current value. In our application we only take the previous value and the current value as we are calculating the EWMA (exponentially weighted moving average) using the last EWMA which is stored in the database.
Equation for calculating EWMA:
EWMA(new) = * EWMA(last) + (1 - α) * Current value

  • As mentioned above EWMA is calculated for each item and using that values all the candidate hosts which fulfill the resource requirements for the request are found.
  • All those candidate hosts information are then sent to VM scheduler so that the scheduling decision can be made in VM scheduler.
  • Other than the candidate hosts, Host Filter also finds the hosts that fulfill memory requirements for the request and that is also passed to VM scheduler. That information is sent for the use of Migration and Preemption Scheduler.

VM Scheduler

VM Scheduler provides orchestration for all components in the Resource Scheduler including Core Scheduler components and Authentication Service. 
  1. Following is the flow of how a request is handled by the VM scheduler:
  2. VM Scheduler receives a Resource Allocation Request via REST API
  3. VM Scheduler passes the request to the Authentication service for authentication and authorization
  4. If the request is authenticated and authorized, authorized request is taken by the VM Scheduler and forwards it to Priority Scheduler which is the coordinator of Migration Scheduler and Pre-emption scheduler. 
  5. Either the Priority Scheduler or Migration Scheduler will return a selected host where the request can be allocated. VM Scheduler then performs request allocation on the selected host via CloudStack API using the CloudStack Interface.
In addition to resource allocation, resource de-allocation is also performed by VM Scheduler with the support of De-allocation manager.

Priority Scheduler

Priority Scheduler acts as the coordinator of Migration Scheduler and Pre-emption scheduler. Priority Scheduler first forwards the authorized resource request to Migration Scheduler to find host(s) to allocate the incoming request. If Migration Scheduler does not return any host information, Priority Scheduler then checks whether there are previous allocations with priority less than the incoming requests, and then sends those allocation information along with the request to the Pre-emption scheduler. If Pre-emption scheduler selected some hosts, those host information is sent back to the VM Scheduler by the Priority Scheduler.

Migration Scheduler

Migration Scheduling is the second step of VM Scheduler if there appears to be no resources to serve the incoming Resource Request. When Priority Scheduler forwards incoming request to the Migration Scheduler, it checks whether the enough resources for the incoming request can be made available in any of the hosts by migrating some of its virtual machines to another hosts.

Preemptive Scheduler

When the Migration Scheduler is unable to find a host for an incoming request by reshuffling VMs on the cloud, the request is then passed to the Pre-emption scheduler. In pre-emption step, Pre-emption scheduler checks the priority of incoming request; compare it with current allocations and takes decision on pre-empting a running allocation in order to gain resources on the incoming resource request. When a request is preempted, preempted request is saved in the database in order to be rescheduled later when resources are available. When preemption fails, it means that there are no enough resources to be allocated for the incoming request and then the Resource Scheduler returns an error message to the user via API stating that there are no enough resources to handle his request.  When there are specific set of VMs to be preempted, Preemption scheduler calls an additional web service implemented in Java, called ‘JVirshService’. In this external call, Preemption Scheduler will pass the list of VMs to be preempted by a RESTful API call in JSON format and JVirshService will perform the preemption.

JVirsh Service RESTful Java Web Service

JVirshService is a java RESTful web service which runs separately from the main Node.js application. This web service internally uses ‘virsh’ command line tool to issue VM snapshot commands to the hypervisor via libvirt API. When preemption scheduler sends a set of virtual machines to be preempted to the JVirshService in JSON format, it internally calls libvirt and performs taking VM snapshots in hypervisor level. Once the snapshot command gets executed successfully, JVirshService will return the IP address of the host on which the VMs were preempted. This IP address will be received by Preemption Scheduler and it will inform VM Scheduler that preemption complete. 

Allocation Queue Manager

Allocation Queue Manager is the component which stores the Resource Requests which cannot be served with currently available resources in MongoDB. When a request arrives, there are currently no enough resources to be allocated for the request, Resource Scheduler then tries Migration Scheduler and Preemption Scheduler to find enough resources for the incoming request. In the Preemptive scheduling phase, if there are no allocations with priority less than the incoming request, Preemption Scheduler returns no hosts for the incoming request. At this point, the request is queued by Allocation Queue Manager and will be allocated when enough resources is available for the request in the Cloud.

De-allocation Manager

De-allocation manager is to perform resource release and re-allocate preempted requests/queued requests in the Allocation Queue Manager.

Configuration updater

Configuration updater is the module which provides methods to change the configuration of the Resource Scheduler. Configuration information of the all components of the Resource Scheduler can be changed using this module. Access to this module is protected and only administrator users are allowed to make configuration updates.

Database Interface

Database interface is the component we use to store and retrieve information from MongoDB storage. Accessing MongoDB is provided by a 3rd party Node.js module called Mongoose. We use Mongoose module to perform queries and updates on MongoDB database.

CloudStack Interface

CloudStack interface is implemented as a Node.js module called ‘csclient’. ‘csclient’ is also a 3rd party module which provides easy access to the CloudStack API with Node.js where complex functionality including request signing is performed inside.

Zabbix Interface

Zabbix Interface is implemented as another Node.js module which provides access to Zabbix Monitoring System via Zabbix API. It has the functions to login into the monitoring system and issue API requests in Node.js.

MongoDB Database

We are using MongoDB as our database storage to store information including configuration information, user information, resource allocation requests, etc. Since we are using Node.js as our development language for the resource scheduler, and also we need to store queued Resource Allocation Requests easily, MongoDB was easier than MySQL. If we use MySQL, we would need to use a proper Object Relational Mapping (ORM) or convert JSON documents into a String and store in the database. In that case, information retrieval would also add additional overhead to create objects from relational data or parse string into JSON. Using MongoDB, we can directly store Resource Allocation Requests in JSON format and retrieve them as JavaScript objects easily.

Results and Evaluation

Resource-Aware Virtual Machine Scheduling

One requirement of our project was to improve the default VM scheduling mechanism of CloudStack in order to improve the availability for users. By default, CloudStack provides four VM allocation algorithms. Allocation algorithm can be changed by updating the Global Configuration parameter named vm.allocation.algorithm. Following are the four allocation algorithms supported by CloudStack.

Parameter value
Pick a random host available across a zone for allocation.
Pick the first available host across a zone for allocation
Host which has the least amount of VMs allocated for a given account is selected. This provides load balancing up to a certain extent for a given user. But running VM count is not considered between different user accounts.
This algorithm is similar to ‘random’ but, this considers hosts within a given pod rather than across the zone.
This algorithm is similar to ‘firstfit’ but, only considers hosts within a given pod rather than across the zone.

By default, random algorithm is selected as the VM allocation algorithm. None of these algorithms provide resource-aware scheduling. Although userdispersing algorithm considers running VM count which belong to given user, it does not consider the resource utilization of VMs and resource availability in the host. Following snapshot from Zabbix Monitoring System shows how VM Allocation is performed using random method in CloudStack. Diagram shows amount of memory available in all hosts. 

Default random VM schedling algorithm of CloudStack
We can identify a VM deploying as a drop of a line in the graph. We can see than the sequence of host selection is Host 3, Host 4, Host 2, Host 4, Host 2, Host 4, Host 2, Host1, Host 1, etc. Using resource utilization information fetched from the Zabbix server, we have implemented mechanism to deploy VM on the host which has least available memory to deploy a given VM. This algorithm is known as best fit. The reason for using this algorithm is to keep as much as free memory as possible in a given host, so that those memory can be allocated to future requests asking for more memory. An alternative algorithm would be a load balancing algorithm which would allocate a VM on the host which has maximum available memory to allocate the VM. But, that algorithm could cause frequent shuffling of VMs (perform Migration Scheduling) when memory intensive VM requests come.

Following snapshot from Zabbix Monitoring System shows how Resource Aware VM scheduling is performed with our scheduling algorithm on CloudStack:

Resource Aware VM scheduling of Smart Cloud Scheduler
Above diagram shows available amount of memory in each host in CloudStack. Note that the green line in Zabbix Diagram shows a host which was later added to CloudStack. This graph shows change in available memory in four hosts when a list of VMs are deployed with following memory requirements:

VM 1 - 2 GB
VM 2 - 2.5 GB
VM 3 - 1 GB
VM 4 - 2 GB
VM 5 - 2 GB
VM 6 - 2 GB

After these six VM deployments, we added the new host to CloudStack which is represented in the diagram with green line. After that, we can see three next deployments were performed on the newly added host since it has the minimum memory available for the requirements. After these three allocations, its available amount of memory drops than 2GB where final VM in our list cannot be deployed on that host. Therefore the final VM which requires 2GB of memory can only be deployed on Host 4 and it gets allocated on Host 4.

VM 7 - 2 GB
VM 8 - 1 GB
VM 9 - 1 GB
VM 10 - 2 GB

Preemptive Scheduling

When there are low priority requests are allocated on the cloud and there are no space for another resource allocation on the cloud, we need Preemptive Scheduling to service high priority requests coming. Since high priority requests need to be immediately allocated, we need to preempt suitable number of currently allocated low priority requests and gain space for the incoming high priority request. When there is no space on the cloud for an incoming request, we move into Migration Scheduling phase in which we re-shuffle all VMs in the cloud using VM live migration to gain space on a specific host for the incoming request. If Migration Scheduling is not possible, we then move to Preemptive Scheduling phase, in which we check the priority in incoming request, compare it with currently allocated request and then take preempting decision based on availability of enough low priority requests which can be preempted to gain space for the incoming request. Following graph is a graph on memory availability against time on all hosts in the cloud

Preemptive Scheduling of Smart Cloud Scheduler.
When we consider the section after the red vertical line, resource scheduler has first created two VMs on host 2. Then it has created 3 VMs on host 4.  Following those, it has further created one VM on host 2 and another one on host 4. Now we are in a situation on where we do not have sufficient memory capacity for more requests asking more than 2GB of memory. Then we get a request asking 2GB of memory. At this point resource scheduler has taken decision to preempt two VMs from host 2 which are consuming 2GB of memory. And then the same host is then used to allocate the incoming request. We can identify that before the preemption, there are 3 VMs running on host 2 and 4 VMs running on host 4. Resource Scheduler has chosen host 2 for preemption because there are less number of VMs running with low priority than the incoming request. This algorithm can further be improved to consider the resource utilization by each VM. This is currently difficult because Zabbix agent based resource monitoring requires each VM to run Zabbix Agent to achieve this.

Migration Scheduling

Migration Scheduling was later added to Smart Cloud Scheduler which can release resource for an incoming request without preemption. Once completed, we have added Migration Scheduler so that it is called before Preemption Scheduler is called. Following is the graph taken from Zabbix with Migration Scheduling and Preemption Scheduling. 

Above graph shows a sequence of following VM deployments:

Priority Level 2 (Medium)
        VM1 - 2.5GB
        VM2 - 1.5GB
        VM3 - 2GB
        VM4 - 2GB
        VM5 - 2GB
        VM6 - 2GB
        VM7 - 2GB
Priority Level 3 (High)
        VM8 - 2GB


Current implementation of our Smart Scheduler has several limitations which needs to be addressed and implemented later. Following are some of those limitations and issues:
  1. VM Group Allocation is not possible yet. In current implementation, we only support one VM deployment per Resource Allocation Request. Multiple copies of same VM (e.g.:- for Lab allocations) and groups of different types of VMs cannot be specified in a single Resource Allocation Request
  2. Advanced Reservation is not available. Our implementation currently support Immediate Allocations and Best Effort service. Depending on the resource availability on the cloud and existing resource allocations, we provide immediate allocation and best effort service depending on the priority of the request. In Advanced Reservation, a user may specify a future time period during which his resource allocation should be available. Current implementation does not support this feature.
  3. Per VM resource monitoring is not available. We currently monitor resource utilization of physical hosts. This has a major drawback because VMs not be utilizing the full amount of allocation they have requested (such as memory). In this case, when the VM is idle, physical host seem to have more resources for new allocations when we monitor host's resource utilization. This may cause further allocations be happened in the same hosts. But when the idle VM utilized entire amount of resources it has been allocated, host gets overloaded and we may need migration or preemption to overcome overloading. To prevent this, we need to monitor resource utilization by each VM seperately and identify when they are idle and start utilizing their entire allocation.
  4. Need improvements to VM preemption algorithm. VM Preemption algorithm currently selects the physical host with minimum number of low priority VMs are allocated for preemption scheduling. Improvement to this algorithm is necessary because although there may be hosts with large numberof low priority VMs running, there may be idle VMs which few of them can be preempted to allocate a high priority incoming request.


Post a Comment

Note: Only a member of this blog may post a comment.