Short and always working- install CUDA on GCP VM (CentOS7/CentOS8)

I’ve spent more than a few minutes trying to run nvidia-smi on centos 7 virtual machines running on GCP (Google Cloud Platform). Checking a few nice instructions on how to run it properly like [1,2]. Getting iritated I told: “No way! It’s just standard DKMS” so I just have to follow a few standard steps as I did a number of times in my life.

Install CUDA on cloud VM

If execution of nvidia-smi only returns “NVIDIA-SMI has failed because it couldn’t communicate with the NVIDIA driver” this quick tutorial is definitivelly something for you.

Install nvidia packages

The best is probably to install those over network from nvidia repositories[1]. In my case CentOS 7 it was:
yum-config-manager --add-repo
yum clean allsudo
yum -y install nvidia-driver-latest-dkms cuda

Run DKMS to build the driver

In many places, you’ll find info that you have to reboot the VM. It can work, since DMKS may rebuild the drivers on reboot, but it’s not necessary and in case of cloud VM you’ll not see any error message from the process. Instead of reboot just make those two simple steps:
#dkms status
nvidia, 418.87.00: added

to verify that DKMS module is installed and run
#dmks autoinstall
Error! echo
Your kernel headers for kernel 3.10.0-957.27.2.el7.x86_64 cannot be found at
/lib/modules/3.10.0-957.27.2.el7.x86_64/build or /lib/modules/3.10.0-957.27.2.el7.x86_64/source.

to build the module. As you probabily noticed it failed in my case telling me that I don’t have kernel-headers installed, which may be confusing for some of you, since quick check rpm -qa | grep kernel-headers will show something opposite. In case of Centos 7, make sure that you have kernel-devel package installed for the running kernel. Use the following one-liner:
rpm -qa | grep kernel-devel-$(uname -r) || yum -y install kernel-devel-$(uname -r)
Now you should be ready to run:dkms autoinstall without an issue.

Load the module

# nvidia-modprobe
# echo $?

You should be able to successfully call nvidia-smi now, like on the picture below.

nvidia-smi finally working!
nvidia-smi output

Let me know if this worked for you!


[FIXED] Ansible Unexpected Exception.

After a few days of working on something else and having a life without computer I came to the project of automated setup of Slurm[1] development environment. As you may know I have some experience with Ansible, so I decided to use a trio of vagrant, libvirt and ansible.
The project has already been started, so I just issued: vagrant up --provider=libvirt and after a few seconds I got the below error message:

ansible-playbook 2.5.1
  config file = /etc/ansible/ansible.cfg
  configured module search path = [u'/home/cinek/.ansible/plugins/modules', u'/usr/share/ansible/plugins/modules']
  ansible python module location = /usr/lib/python2.7/dist-packages/ansible
  executable location = /usr/bin/ansible-playbook
  python version = 2.7.15rc1 (default, Nov 12 2018, 14:31:15) [GCC 7.3.0]
Using /etc/ansible/ansible.cfg as config file
Parsed /home/cinek/mySlurm2/.vagrant/provisioners/ansible/inventory/vagrant_ansible_inventory inventory source with ini plugin
ERROR! Unexpected Exception, this is probably a bug: sequence item 0: expected string, NoneType found
the full traceback was:

Traceback (most recent call last):
  File "/usr/bin/ansible-playbook", line 118, in 
    exit_code =
  File "/usr/lib/python2.7/dist-packages/ansible/cli/", line 122, in run
    results =
  File "/usr/lib/python2.7/dist-packages/ansible/executor/", line 81, in run
    pb = Playbook.load(playbook_path, variable_manager=self._variable_manager, loader=self._loader)
  File "/usr/lib/python2.7/dist-packages/ansible/playbook/", line 54, in load
    pb._load_playbook_data(file_name=file_name, variable_manager=variable_manager)
  File "/usr/lib/python2.7/dist-packages/ansible/playbook/", line 106, in _load_playbook_data
    entry_obj = Play.load(entry, variable_manager=variable_manager, loader=self._loader)
  File "/usr/lib/python2.7/dist-packages/ansible/playbook/", line 109, in load
    data['name'] = ','.join(data['hosts'])

I had a number of unstructured thoughts, at the beginning – blaming vagrant, vagrant-libvirt plugin and every one but ansible or myself. However, just executing ansible-playbook from the command line I was able to reproduce the issue. OK.. let’s open as suggested by error message. The load method didn’t look suspicious at all. I add a dirty print commands there

A keys pressed quickly: :wq
and I saw the below output:

[u'slurmd_1', u'slurmd_2', u'slurmd_3']
ERROR! Unexpected Exception, this is probably a bug: sequence item 0: expected string, NoneType found

Well.. yes, this mysterious error message was coming from the playbook I was working on a few days ago and left it with empty hosts list at the end of the file (like on the listing below):

I thought WWJD 🙂 Maybe I can save someone a few minutes adding more meaningful message there. I’ve created a pull request that should return more meaningful message in the situations like the one I had [2].



Ansible based automated deployment of iRODS grid.

It’s been a long time since my last post – more than 4 months, very busy and exciting time for me. The main technical topic I have been working on is deployment of iRODS[1] for management of tens of PB of data, analyed by hundreds of engineers with the use of classical HPC resources, cloud providers and traditional desktops in company network. I hope I’ll have time to share universal part of it in series of posts. Starting with the one today about…

Ansible based automated deployment of iRODS grid.

Ansible has been my orchestration choice for more than 5 years now, so it was kind of natural to develop irods-srv role[2]. It automates the process of installation of both iRODS catalogue provider and consumer (aka resource). Recommended installation process of iRODS makes use of a python script that gets additional information like iRODS zone name, port and various passwords interactively from administrator performing the installation. This can be automated by either sending appropriate answers to standard input or by --json_configuration_file option given to the script /var/lib/irods/scripts/ while executing it. My goal was not only to make it working, but also to understand the details, so I decided to build my own “unattended_installation.json”.

If you have existing iRODS grid you’d like to use as a base for automated installation with “`.json“` file you can execute izonereport command to get its full configuration.

If you’ll see error message:ERROR - failed in call to rcZoneReport - -154000 as a result of izonereport it may mean that your server_config.json attributes are missing – in my case it was schema_version [3].

To make sure that you have only necessary fields you can check irods configuration schemas [5]. Following this I’ve found a few additional fields in server_config section not required by schema validation being critical for replication functionality. The whole section is really used to create a separate irods configuration file /etc/irods/server_config.json.

Errors you may see executing irepl

Action: Replication between resources on two servers. You may see different errors in absence of specific advanced_settings

Result: Replication fails with error:
remote addresses: ERROR: replUtil: repl error for /HPCC/home/rods/testFile, status = -1800000 status = -1800000 KEY_NOT_FOUND
rodsLog contains message:

May  3 15:00:16 pid:10026 remote addresses:, ERROR: iRODS Exception:
file: /tmp/tmppTB_kL/lib/core/include/irods_configuration_parser.hpp
    function: T &irods::configuration_parser::get(const key_path_t &) [T = const int]
    line: 105
    code: -1800000 (KEY_NOT_FOUND)
        key "transfer_buffer_size_for_parallel_transfer_in_megabytes" not found in map.

Resolution: Add transfer_buffer_size_for_parallel_transfer_in_megabytes to advanced_settings dictionary in server_config.json

Result: irepl -v shows “0” threads for the replication. In fact one thread transfer is done over the icat server.
rodsLog contains message: May 3 16:42:08 pid:1186 remote addresses: ERROR: getNumThreads: acGetNumThreads error, status = -1800000
Resolution: Add “default_number_of_transfer_threads” to “advanced_settings” dictionary in server_config.json.

Result: irepl command fails with error message: remote addresses: ERROR: replUtil: repl error for /HPCC/home/rods/testFile, status = -1800000 status = -1800000 KEY_NOT_FOUND”
rodsLog contains information:

May  3 16:47:16 pid:1230 remote addresses: ERROR: iRODS Exception:
    file: /tmp/tmppTB_kL/lib/core/include/irods_configuration_parser.hpp
    function: T &irods::configuration_parser::get(const key_path_t &) [T = const int]
    line: 105
    code: -1800000 (KEY_NOT_FOUND)
        key "maximum_size_for_single_buffer_in_megabytes" not found in map.
stack trace:
[stack trace here]
May  3 16:47:16 pid:1230 NOTICE: rsDataObjRepl - Failed to replicate data object.

Resolution: Add maximum_size_for_single_buffer_in_megabytes to advanced_settings in server_config.json dictionary.

I gathered discovered issues in pull request to iRODS configuration schema repository[5] to get feedback from the team. I think that all of them should be marked as required. One of concerns (default_number_of_transfer_threads) affects multithreaded transfers only and falls back to single thread, but still ends-up with error message in rodsLog and omitting the key in configuration file can’t be a recommended way to achieve the behavior.

Another section of special interest in .json file we have to pass to script is hosts_config which is actually the content of the /etc/irods/hosts_config.json file after the installation. This file is kind of iRODS’ own /etc/hosts file. If you can fully rely on DNS you don’t have to use it at all. However, in some complicated scenarios, like DNS round-robin servers providing access to shared file system it may help to make sure that each individual server won’t be redirecting traffic to others in his round-robin group. I decided to build it based on ansible dictionary that should be shared between all hosts in the grid. Storing this dictionary in ansible group_vars may be a convenient way to distribute it between all servers in the grid. Compared to /etc/hosts every server in hosts_config.json is specified as either local or remote. I assumed that ansible_default_ipv4.address will be on the list of IPs configured for each host, based on its existence on the list template selects local type. For the details please check the contents of hosts_config.json.j2[6].

Obviously, the approach with dedicated python script executed as part of installation process is not straightforward to integrate with orchestration frameworks like ansible. The important aspect of orchestration is that the same code that is used for deployment is used to maintain long term services configuration. Unfortunately iRODS will just fail when executed on already configured catalogue server – with error message stating that: IrodsError: Database specified already in use by iRODS. We can overcome this in ansible using failed_when and ignore_errors construct, nevertheless I think that executing this script on iRODS installation that is already running is not a good idea at all. It’s not tested in such scenario and it has additional side effects of resource creation and execution of iput/iget commands.

My approach is to execute the script only during the first installation with configuration JSON file created from template where sections responsible for specific files are included from separate templates. Those templates are then be used to generate files like “server_config.json” or “hosts_config.json” when servers are in operation.

Current status of the role is working, but for sure there is a lot of things that can be improved. The next step in development will be finaliazation of molecule based CI followed by code restructuring.

Since my main focus was on deployment of new environment now it will be very difficult to work on the role without the possibility to quickly recreate development environment from scratch. To automate the process I decided to use vagrant[7]. In the first release of the role on github you’ll find Vagrantfile as well. Thanks to it, you can start playing with the role and/or iRODS fairly easily – simply clone the repository and start three (1 catalogue + 2 resource) node iRODS grid just executing vagrant up[8] like on the snippet below:

cinek@cinek-schmd:~/git-repos/ansible-role-irods-srv$ vagrant up --provider=libvirt
Bringing machine 'icat' up with 'libvirt' provider...
Bringing machine 'ires-a1' up with 'libvirt' provider...
Bringing machine 'ires-a2' up with 'libvirt' provider...
PLAY RECAP *********************************************************************
ires-a2                    : ok=16   changed=13   unreachable=0    failed=0   
cinek@cinek-schmd:~/git-repos/ansible-role-irods-srv$ vagrant ssh ires-a2
Last login: Fri May  3 17:58:39 2019 from
[vagrant@ires-a2 ~]$
[vagrant@ires-a2 ~]$
[vagrant@ires-a2 ~]$ ping ires-a1.local
PING ires-a1.local ( 56(84) bytes of data.
64 bytes from ires-a1.local ( icmp_seq=3 ttl=64 time=0.916 ms
[root@ires-a2 vagrant]# sudo su
[root@ires-a2 vagrant]# su - irods
Last login: Fri May  3 18:15:37 UTC 2019 on pts/0
-bash-4.2$ ils
-bash-4.2$ ilsresc
-bash-4.2$ mkdir /var/lib/irods/testResource
-bash-4.2$ iadmin mkresc testA2 unixfilesystem ires-a2.local:/var/lib/irods/testResource/
Creating resource:
Name:        "testA2"
Type:        "unixfilesystem"
Host:        "ires-a2.local"
Path:        "/var/lib/irods/testResource/"
Context:    ""
-bash-4.2$ dd if=/dev/zero of=/tmp/testFile bs=1M count=100
100+0 records in
100+0 records out
104857600 bytes (105 MB) copied, 0.147028 s, 713 MB/s
-bash-4.2$ iput  /tmp/testFile
-bash-4.2$ ils -l
  rods              0 demoResc    104857600 2019-05-03.18:18 & testFile
-bash-4.2$ ls /var/lib/irods/testResource/
-bash-4.2$ irepl -R testA2 testFile
-bash-4.2$ ils -l
  rods              0 demoResc    104857600 2019-05-03.18:18 & testFile
  rods              1 testA2    104857600 2019-05-03.18:18 & testFile
-bash-4.2$ ls /var/lib/irods/testResource/
-bash-4.2$ ls /var/lib/irods/testResource/home/rods/testFile
-bash-4.2$ ls -l /var/lib/irods/testResource/home/rods/testFile
-rw-------. 1 irods irods 104857600 May  3 18:18 /var/lib/irods/testResource/home/rods/testFile

As you can see I just started 3 CentOS7 VMs and applied appropriate ansible playbook that ended up with iRODS grid running on my laptop (just executing vagrant up). Then I logged in to one of the resource servers ires-a2, defined a resource on it and used iput to upload the file to resource on catalog server. Finally, the file was replicated by irepl back to ires-a2.

Comments appreciated, especially because I’m thinking about more posts on iRODS in the near future!


How to develop ansible-review standards?

One of the very important aspects of infrastructure as a code approach is automated testing of code standards. In my case the code is ansible playbooks, tasks and variables. Looking for available solution I found a great tool announced by Whill Thames on his blog [1] – ansible-review [2]. Unfortunately the example standard shown on [1] is very simple and other standards rely on ansible-lint which doesn’t show full capabilities of ansible-review. In this post I’d like to share my experience with standards development, based on a few examples. I’m not going to discuss basic usage of ansible-review – you can check that on one of cited sites.

How to develop ansible-review standards?

In all examples below I assume you have imported packages from the example

Example1: Don’t allow spaces in task name.

To make it easier to copy task name (double click on text to mark the word) and use it in --start-at-task it’s convenient to forbid spaces in task names. To achieve this with ansible-review you may use code similar to one below:

Reading of the code should start in line 14 where we define the standard called task_name_should_not_have_spaces. As required by ansible-review framework standard is an association of:

  1. name – the string describing the standard,
  2. check – python function that will verify if the standard is matched,
  3. types of ansible elements where the standard applies (tasks, defaults, playbook, etc.),
  4. and version in which the standard is enforced.

In the example above evaluation is performed by function called check_task_name defined in 1st line of the code snippet. Check functions receive 2 arguments:

  1. candidate – a file being reviewed
  2. and setting – a dictionary with lintdir, config file and a few other settings. Personally I don’t have any particular use case for them in check function.

What happens in our 1st check function is straightforward. We open the file, read it line by line. If the second word is name: (The first is just ) then we check if the number of words in the line is greater than 3, since this indicates space in name. If this is the case we add an Error to the errors list in the Result object returned by check function.

Example 2: Check if variables defined in vaulted defaults/main.yml file are prefixed with role name.

If you have some experience in maintaining an ansible repository with multiple roles you probably know that having variables defined in different places sometimes make it difficult to predict the result. Because of that it may be a good practice to prefix variables defined in defaults with the name of the role, where it was defined. Even if you’ll overwrite this variable in group settings it will always remind you that initial idea was to use it within a role. It’s quite common to store some secrets like passwords and API tokens, so the good practice is to encrypt those files. In the example below additionally to standard ansible-review imports we use ansible-vault python package to handle data decryption.

As you can see in the listing above we define the standard called all_defaults_start_with_rolename that will be applied only to defaults. In the check function we parse out the role name from file path (actually doing a double verification of the file being a defaults/main.yml). As you see in line 9 it’s assumed that vault password is stored in /etc/ansible/vault-password than we use Vault.load method to decrypt and read the file. This method internally executes yaml parser that stores the input into defaults dictionary. Our next step is simple iteration over the keys in dictionary to check their names. If one is missing appropriate prefix another entry is added into reslut.errors list.

Example 3: Make sure that all tasks in role have a standard “role” tag.

One of the very important differences of ansible and puppet is that it’s more natural to push configuration changes to hosts than automatically execute everything on hosts. When you’re deploying changes you normally use playbooks containing much more than latest modification. I find it convenient to have standard tags for all tasks in role. This is something you can achieve with the standard defined below:

It’s quite similar to the previous example. Input data is not encrypted, so instead of Vault.load loading of YAML is done by parse_yaml_linenumbers function, beside that the logic is easy to understand – the loop in line 9 iterates over all tasks, checks if they are tagged with role_ROLE NAME.

In this case it’s quite important to emphasise that it may not be sufficient in case of tasks included dynamically from main.yml, those will not inherit the tag, so the logic should be a little bit more sophisticated. If you have this issue addressed – let me know 🙂

Example 4: Fail on services restarted in tasks.

Personally I think that services should never be restarted in tasks, this should be always done by handler. Tasks should only use services states like started or stopped. Implementation of this may look like:

The most crucial part is the if statement in line 7. It checks if the task has service key indicating the module used. In this case we check if state was specified and report an error when it’s restarted. As for exmaple 3, there are some cases where this standard will not work like task without full YAML format (in line module arguments) or use of local_action.

I hope those examples will help you developing your own ansible-review standards!


Notes from the XDMoD patch mitigating the issue with overestimated wall time for suspended jobs.

2018-09-13_09h22_48XDMoD[1] is a fantastic tool that allows various summaries of HPC clusters accounting. It supports all popular HPC resource managers including Slurm[2], which is the queuing system of my choice for more than 5 years. I have a very good opinion on XDMoD code quality, so the day I saw utilization of the cluster being over 100% for a few days my eyes were on stalks. (You can see the plot for the stacked per project utilization of the cluster below). Checking on the queuing system accounting I simply used sreport command which shown that the nodes for a specific day were occupied all the time, but this should end-up with 100%, shouldn’t it?

Notes from the XDMoD patch mitigating the issue with overestimated wall time for suspended jobs.

Digging deeper into the details of the jobs executed during this period I noticed that there are several with long “Suspended” time. My guess was that maybe this is the reason of XDMoD overestimated cluster utilization. Asking this question on support mailing list I received a confirmation that similar issues were observed on the cluster with job suspension enabled(Thanks to Trey Dockendorf, for prompt replies). Finally checking the code I understood one very fundamental difficulty – queuing system doesn’t provide the time slots when the job was running, so XDMoD creating a plot as a function of time doesn’t have all the required information to work 100% correct.

I feel that like the best approach would be to add another parameter to XDMoD like fraction of time or change number of CPUs to floating number and in case of suspended or gang scheduled jobs recalculate it to be numberOfCoresUsed * (endTime – startTime) / wallTime. I didn’t feel like this is a change I can implement within a few hours and get it merged to upstream, but in my case the issue was coming only from a few short jobs (a few minutes of wall time) that were suspended for more than a day.

Quick fix for me was to add another dates validation into the “Shredder” code – simply if endTime – startTime is larger than wallTime provided by the queuing system I decided to falsify endTime with the value of startTime + wallTime. Such approach won’t fix all potential issues, but it will mitigate it a lot. The issue will still be visible in the case of gang scheduled jobs. XDMoD will show all of them running in the same time period (utilization over 100%) ending up earlier than it really happened (appropriately lower utilization of the cluster in this period). However, the clear benefit is that total utilization will be correct. At the time of writing of this post I’m trying to get this merged into XDMoD github project. We’ll see if this will be accepted [3], but if you’re looking for this partial fix just update your code with the patch from the gist below.

Since this applies changes on “Shredder” stage it won’t fix data you already have ingested. To achieve this, you’ll have to remove this part of data and ingest it one more time. This requires manual jobs data removal from XDMoD backend databases, which can be done with the help of the script below.


[SOLVED] Singularity 2.6 – fails to resize writable container

Executing singularity image expand -s 1G ./mycontainer.sigm failed for me with following error message:

e2fsck 1.41.12 (17-May-2010)
e2fsck: Superblock invalid, trying backup blocks...
e2fsck: Bad magic number in super-block while trying to open ./centos.simg

The superblock could not be read or does not describe a correct ext2
filesystem.  If the device is valid and it really contains an ext2
filesystem (and not swap or ufs or something else), then the superblock
is corrupt, and you might try running e2fsck with an alternate superblock:
    e2fsck -b 8193 

which suggests that e2fsck is simply called against file/device that doesn’t have EXT file system inside. However, because I knew that this is writable image[1] it means that EXT is there. My guess was that file system doesn’t start from the very beginning which… happened to be the issue.

Debuging and fixing singularity image expand failure.

I was able to write inside container, so it was easy to find the way file system is mounted. Simply run a sleep process inside singularity and check mounts of the process in /proc/PID directory. In my case following commands were helpful:

[root@srv ]# singularity exec ./centos.sigm sleep 10m &
[3] 6342
[root@srv ]# ps aux | grep sleep
root      6342  0.0  0.0 100916   608 pts/18   S+   10:22   0:00 sleep 10m
root     15721  0.0  0.0 103252   864 pts/24   S+   06:56   0:00 grep sleep

I’ve added ps output just to make it obvious that singularity that returned PID is actually sleep running within singularity containter.

For the purpose of the post I’ll write only |grep loop part of process mounts, since it’s quite long. As you’ll see on the listing below the device is mounted as ext3 file system in read-only mode, which is the case because I didn’t add --writable option to my singularity exec.

[root@srv ]#cat /proc/6342/mounts | grep loop
udev /dev/loop1 devtmpfs rw,relatime,size=132263744k,nr_inodes=33065936,mode=755 0 0
/dev/loop2 /local/singularity/mnt/container ext3 ro,nosuid,relatime,errors=remount-ro,barrier=1,data=ordered 0 0
/dev/loop2 /local/singularity/mnt/final ext3 ro,nosuid,relatime,errors=remount-ro,barrier=1,data=ordered 0 0
udev /local/singularity/mnt/final/dev/loop1 devtmpfs rw,relatime,size=132263744k,nr_inodes=33065936,mode=755 0 0

Let’s check how the loop device was created to verify if the issue is really an offset of ext3 formatted space inside our file:

[root@srv ]# losetup -a | grep loop2
/dev/loop2: [001d]:6712556 (), offset 31

Bingo! the offset is 31, simply manually creating a loop device with this offset and running tools like dd, e2fsck and resize2fs allowed me to resize the container file system. Checking the code I found that in version 2.6 the whole responsibility is on the shell script called image.expand.exec. I’m not sure if the offset is always 31, but in this case you can use the patch below (it’s done against 2.6 tag)

Thanks to @jmstover [2] I know that those 32 bytes it’s something expected in every .sigm file, since it’s simply shebang:

[root@usinkok-log01 singularity]# head -c 32 ./centos.sigm
#!/usr/bin/env run-singularity

The goal of it is to allowi simplified execution of applications in the container – doing for instance ./myContainer.sigm. Nevertheless, the offset is fixed so I submitted my patch as a pull request[3].


What one should know developing an API 2 API serverless translating proxy.

Grafana snow integration schemeAs a feedback from the last post about new version of snow-grafana-proxy[1] I’ve got a question on similar functionality implemented as AWS Lambda, which is a cloud service dedicated for serverless infrastructure. In this model you pay for memory*time computing hours, you don’t have to think about the platform. You may think that you have to keep it running because it has to listen for incoming connections, but here AWS API Gateway service comes into play allowing you to configure an HTTP listening endpoint executing AWS Lambda function per request. The whole concept is depicted on the schema.

How to configure AWS API gateway with Lambda functions working as a backend?

This question was answered number of times, so instead of repeating it I’ll just redirect you to AWS dosc and blog post I read [2,3]. If you are interested in configuration done from awscli you can find appropriate commands in README file in subproject directory [4].

I’d like to focus on a few hints for those who would like to create similar service, so the topic is:

What one should know developing an API 2 API serverless translating proxy.

1) Prepare your own local test cases.
However it’s possible to test everything as AWS Lambda free tier I think it’s not an efficient test procedure. For some reason I’m vim user, I really don’t use any sophisticated IDE – maybe this is why the option to edit a script in AWS web interface was not comfortable enough for me. The easiest way was to add a few last lines:

with those few lines I was able to test proxy -> interface without the need to repeatedly update lambda functions code. It requires to zip the files and upload new-version, both are quick operations, but still if you’re doing it tens or hundreds of times event additional 5 seconds matters.

2) Before you start check if everything can be serverless.
In my case it’s really true – one can use grafana as a service, ServiceNow is also hosted for users without the need to have your own servers, so snow-grafana-proxy based on SaaS makes perfect sense. If you need to maintain the platform for one of those sides, lightway translating proxy can and should be deployed there.

3) Check if backend API replies promptly.
I was testing this on instance from ServiceNow developers program, but also from our company reality I know that it may take some time to get an answer for ServiceNow. Of course, you can adjust AWS Lambda function timeout, however, you pay for execution walltime. It makes no difference if your function was waiting or doing real job, so from cost perspective my tests on developer instance, which shown the need to increase the timeout to ~30s, where not very promising.

4) Serverless means no daemon.
This may simplify development of “translator”, but will also limit your possibilities. In my case I had to remove a lot of in-memory caching, since it didn’t make sense. Subsequent call will start the process from scratch – it won’t “remember” what was the result from backend API we got 5s ago. Of course you can use another cloud service to store this. Interested in this – create a pull request or open an issue with suggestions 🙂 You can also use caching capabilities of AWS API Gateway, but… in case of API 2 API it’s highly probable that configuring this won’t be a piece of cake – for instance parts of the request may not be relevant, which means that you may get lower hit-rate than possible.

5) Hard-coding may not be bad idea.
Simply hardcoding configuration into scripts will reduce the number of dependencies. It’s important since if you need additional modules (not available in AWS Lambda by default) you have to create a deployment package [5]. It’s not difficult, especially if you’re developing your functions in virtualenv, but will have impact on price, since it will increase memory requirements (important if you’re going above minimum of 128MB and you’re not waiting 20s for backend reply – as I was :).

Finally, special thanks goes to @JanGarlaj who opened the feature request and gave me a few hints on how the implementation should look like.

You may be interested in other posts from this category.


Integration for ServiceNow table API and grafana.

Some time ago I wrote a blog post about my approach to ServiceNow and grafana integration, you can find it under this link[1]. The key concept used there is presented on the diagram below. Grafana snow integration scheme. Besides the technical aspects of integration, operational results were very good and reduced time incidents were spending in my “queue” – simply giving an overview of what is assigned to whom and what’s the status of tickets. However, due to the lack of flexibility in 1st version of snow-grafana-proxy implementation it was difficult to reuse it in other places. Attributes returned to grafana, lookup methods and the table were hard-coded. I decided to rewrite the service and here we are.

New version of snow-grafana-proxy available!

You can find new release available on projects github page[2]. I can say it’s in beta phase – there are no known issues. However, if you’ll encounter any difficulty just let me know opening an issue on project github page. (I’ve been testing it on Kingston developer instance.)

New configuration file

Configuration file format changed from ini to YAML. This change allowed much more structured configuration. In current state it’s possible to configure multiple queries, against any service-now table with arbitrary filters. Each value has a configurable “interpreter”, at the time of post publication available are:

  • none – Simply return value of the attribute specified as “name” argument.
  • map – Use “map” dictionary defined for attribute and send corresponding values assuming that value from service-now is a key for “map” dictionary.
  • object_attr_by_link – Assumes that for this attribute service-now API returns value/link pair. In this case additional HTTP request may be needed to get information available under the link. This interpreter requires additional parameters specified in interpreterParameters dictionary, for instance: interpreterParams: { linkAttribute: "name", default: "FailedToGetName"} will send to grafana value of the name attribute available under the link from previous get request. In case of failure interpreter will return the value “FailedToGetName”. Default value is important since sometimes the value is really undefined – like description of assignment group for unassigned incident. Those values are cached until snow-grafana-proxy restart which greatly reduce number of REST calls to service-now.

An example configuration file is available in the repository, let me quote it here:

As you see there are a few additional parameters I forgot to explain:

  • cacheTime – which will cache query replies for specified number of seconds, so if we get the information once and someone does the same query after a few seconds (another user having grafana dashboard opened or even the same one, but with very short auto refresh time) proxy will reply immediately with the same information.
  • snowFilter – the value passed to ServiceNow API as a request parameters. You can check available parameters in snow documentation[3].
  • table – quite self-explanatory, name of table you’d like to get information from, like incident, sc_task, change.

Resuls of the example configuration file, may look like on images below.

New command line options

Additionally to changed configuration file, this version supports two command line options. Explained in commands help:

 usage: [-h] [-c CONFIGFILE] [-d]

Simple JSON data source for grafana retreving data from service-now

optional arguments:
  -h, --help            show this help message and exit
                        Configuration file to use
  -d, --debug           Start in foreground and print all logs to stdout

Yes! By default snow-grafana-proxy will daemonizeitself, if you’d like to run in foreground for debug purpose just use -d option as explained above.

Feedback appreciated!


[VLOG] How to configure sssd idmap parameters to get minimal collision probability?

This post covers quite complicated topic of sssd configuration parameters for SID to UID/GID mapping. I promised to share this with you a few weeks ago. The key aspect here is to understand the principles of mapping algorithm implemented in sssd, which is something I described in previous post and vlog, however consequences may be not so obvious.

To help myself configuring parameters like ldap_idmap_range_max, ldap_idmap_range_min or ldap_idmap_range_size. I developed a mapping simulator – you can find the code on github. At the time of this post publication all parameters are hardcoded, but it’s very easy to adjust the script to your needs.

Interested in some details? Watch the video.

Ansible role used as plugin for simple REST API services.

For quite long time in our infrastructure we’ve been managing services in icinga2 based on set of lists defined in ansible variables. The advantage of this approach was nice icinga2 configuration with one file per host gathering all services. However, the direct consequence of this was the need to manage all monitoring endpoints within one ansible task – simply a template with definition of host and all “Apply” statements. At some point this approach started to be annoying because all the icinga definitions were in one role – different than software and monitoring plugins deployment. Second issue were files with variables reaching thousand lines (reading such a long yaml is not a pleasure). One day I thought – it’s time to switch to services creation over API, there must be an ansible module that will help us

How to dynamically create icinga2 services from ansible?

Surprisingly, when you check ansible website[1] you’ll notice that at the time of the post publication in main distribution there are two icinga2 modules – icinga2_feature and icinga2_host. I had some experience with icinga2 API, so my first thought was to develop icinga2_service module or join some project working on it. However, after a few minutes I realized that it’s really only about a few rest calls which can be implemented as ansible role applied to other roles as a dependency.

Icinga2 API has basically 4 calls for service objects, each realized by different HTTP method. Those calls are really self-explanatory:

  • GET, to check if service exist and read it’s attributes
  • PUT, to add new service
  • POST, to update existing one

Basic design for the role is to pass it information about the service to be monitored with parameter describing its state: deleted/present and dictionary with all necessary service attributes (name, check_comand, etc.). First task will just check if service exist and register result as a variable then if service state is present we create or update the existing service (based on the registered result) or delete the service if registered result of GET shows that service exist You can check full implementation github gist [2]. If you want to add specific monitoring service to software deployed in your role, you can simply add the below line to meta/main.yml

 - { role: icinga2-service, service: { state: present, command: PassiveCheck, name: TestService2}

Or apply it in a playbook, like here:

- hosts:
  - usinkok-acc01
  - icinga2-service
    name: TestService2
    state: present
    command: PassiveCheck

The only missing piece is the check if the service was changed during the update. Unfortunately, current icinga2 replay to POST in both cases is simply 200. Making it more difficult to correctly implement changed_when.

I’m checking the possibility to add this feature to icinga2. If I’ll be successful on this field I’ll let you know here :).
Nevertheless, what do you think about this approach? I thought about import_task as an alternative, but in this case I’m not sure where can I store some sensitive variables like API user secret password. If you want to share your ideas feel free to leave a comment.