Notes from the XDMoD patch mitigating the issue with overestimated wall time for suspended jobs.

2018-09-13_09h22_48XDMoD[1] is a fantastic tool that allows various summaries of HPC clusters accounting. It supports all popular HPC resource managers including Slurm[2], which is the queuing system of my choice for more than 5 years. I have a very good opinion on XDMoD code quality, so the day I saw utilization of the cluster being over 100% for a few days my eyes were on stalks. (You can see the plot for the stacked per project utilization of the cluster below). Checking on the queuing system accounting I simply used sreport command which shown that the nodes for a specific day were occupied all the time, but this should end-up with 100%, shouldn’t it?

Notes from the XDMoD patch mitigating the issue with overestimated wall time for suspended jobs.

Digging deeper into the details of the jobs executed during this period I noticed that there are several with long “Suspended” time. My guess was that maybe this is the reason of XDMoD overestimated cluster utilization. Asking this question on support mailing list I received a confirmation that similar issues were observed on the cluster with job suspension enabled(Thanks to Trey Dockendorf, for prompt replies). Finally checking the code I understood one very fundamental difficulty – queuing system doesn’t provide the time slots when the job was running, so XDMoD creating a plot as a function of time doesn’t have all the required information to work 100% correct.

I feel that like the best approach would be to add another parameter to XDMoD like fraction of time or change number of CPUs to floating number and in case of suspended or gang scheduled jobs recalculate it to be numberOfCoresUsed * (endTime – startTime) / wallTime. I didn’t feel like this is a change I can implement within a few hours and get it merged to upstream, but in my case the issue was coming only from a few short jobs (a few minutes of wall time) that were suspended for more than a day.

Quick fix for me was to add another dates validation into the “Shredder” code – simply if endTime – startTime is larger than wallTime provided by the queuing system I decided to falsify endTime with the value of startTime + wallTime. Such approach won’t fix all potential issues, but it will mitigate it a lot. The issue will still be visible in the case of gang scheduled jobs. XDMoD will show all of them running in the same time period (utilization over 100%) ending up earlier than it really happened (appropriately lower utilization of the cluster in this period). However, the clear benefit is that total utilization will be correct. At the time of writing of this post I’m trying to get this merged into XDMoD github project. We’ll see if this will be accepted [3], but if you’re looking for this partial fix just update your code with the patch from the gist below.

Since this applies changes on “Shredder” stage it won’t fix data you already have ingested. To achieve this, you’ll have to remove this part of data and ingest it one more time. This requires manual jobs data removal from XDMoD backend databases, which can be done with the help of the script below.



[SOLVED] Singularity 2.6 – fails to resize writable container

Executing singularity image expand -s 1G ./mycontainer.sigm failed for me with following error message:

e2fsck 1.41.12 (17-May-2010)
e2fsck: Superblock invalid, trying backup blocks...
e2fsck: Bad magic number in super-block while trying to open ./centos.simg

The superblock could not be read or does not describe a correct ext2
filesystem.  If the device is valid and it really contains an ext2
filesystem (and not swap or ufs or something else), then the superblock
is corrupt, and you might try running e2fsck with an alternate superblock:
    e2fsck -b 8193 

which suggests that e2fsck is simply called against file/device that doesn’t have EXT file system inside. However, because I knew that this is writable image[1] it means that EXT is there. My guess was that file system doesn’t start from the very beginning which… happened to be the issue.

Debuging and fixing singularity image expand failure.

I was able to write inside container, so it was easy to find the way file system is mounted. Simply run a sleep process inside singularity and check mounts of the process in /proc/PID directory. In my case following commands were helpful:

[root@srv ]# singularity exec ./centos.sigm sleep 10m &
[3] 6342
[root@srv ]# ps aux | grep sleep
root      6342  0.0  0.0 100916   608 pts/18   S+   10:22   0:00 sleep 10m
root     15721  0.0  0.0 103252   864 pts/24   S+   06:56   0:00 grep sleep

I’ve added ps output just to make it obvious that singularity that returned PID is actually sleep running within singularity containter.

For the purpose of the post I’ll write only |grep loop part of process mounts, since it’s quite long. As you’ll see on the listing below the device is mounted as ext3 file system in read-only mode, which is the case because I didn’t add --writable option to my singularity exec.

[root@srv ]#cat /proc/6342/mounts | grep loop
udev /dev/loop1 devtmpfs rw,relatime,size=132263744k,nr_inodes=33065936,mode=755 0 0
/dev/loop2 /local/singularity/mnt/container ext3 ro,nosuid,relatime,errors=remount-ro,barrier=1,data=ordered 0 0
/dev/loop2 /local/singularity/mnt/final ext3 ro,nosuid,relatime,errors=remount-ro,barrier=1,data=ordered 0 0
udev /local/singularity/mnt/final/dev/loop1 devtmpfs rw,relatime,size=132263744k,nr_inodes=33065936,mode=755 0 0

Let’s check how the loop device was created to verify if the issue is really an offset of ext3 formatted space inside our file:

[root@srv ]# losetup -a | grep loop2
/dev/loop2: [001d]:6712556 (), offset 31

Bingo! the offset is 31, simply manually creating a loop device with this offset and running tools like dd, e2fsck and resize2fs allowed me to resize the container file system. Checking the code I found that in version 2.6 the whole responsibility is on the shell script called image.expand.exec. I’m not sure if the offset is always 31, but in this case you can use the patch below (it’s done against 2.6 tag)

Thanks to @jmstover [2] I know that those 32 bytes it’s something expected in every .sigm file, since it’s simply shebang:

[root@usinkok-log01 singularity]# head -c 32 ./centos.sigm
#!/usr/bin/env run-singularity

The goal of it is to allowi simplified execution of applications in the container – doing for instance ./myContainer.sigm. Nevertheless, the offset is fixed so I submitted my patch as a pull request[3].


What one should know developing an API 2 API serverless translating proxy.

Grafana snow integration schemeAs a feedback from the last post about new version of snow-grafana-proxy[1] I’ve got a question on similar functionality implemented as AWS Lambda, which is a cloud service dedicated for serverless infrastructure. In this model you pay for memory*time computing hours, you don’t have to think about the platform. You may think that you have to keep it running because it has to listen for incoming connections, but here AWS API Gateway service comes into play allowing you to configure an HTTP listening endpoint executing AWS Lambda function per request. The whole concept is depicted on the schema.

How to configure AWS API gateway with Lambda functions working as a backend?

This question was answered number of times, so instead of repeating it I’ll just redirect you to AWS dosc and blog post I read [2,3]. If you are interested in configuration done from awscli you can find appropriate commands in README file in subproject directory [4].

I’d like to focus on a few hints for those who would like to create similar service, so the topic is:

What one should know developing an API 2 API serverless translating proxy.

1) Prepare your own local test cases.
However it’s possible to test everything as AWS Lambda free tier I think it’s not an efficient test procedure. For some reason I’m vim user, I really don’t use any sophisticated IDE – maybe this is why the option to edit a script in AWS web interface was not comfortable enough for me. The easiest way was to add a few last lines:
with those few lines I was able to test proxy -> interface without the need to repeatedly update lambda functions code. It requires to zip the files and upload new-version, both are quick operations, but still if you’re doing it tens or hundreds of times event additional 5 seconds matters.

2) Before you start check if everything can be serverless.
In my case it’s really true – one can use grafana as a service, ServiceNow is also hosted for users without the need to have your own servers, so snow-grafana-proxy based on SaaS makes perfect sense. If you need to maintain the platform for one of those sides, lightway translating proxy can and should be deployed there.

3) Check if backend API replies promptly.
I was testing this on instance from ServiceNow developers program, but also from our company reality I know that it may take some time to get an answer for ServiceNow. Of course, you can adjust AWS Lambda function timeout, however, you pay for execution walltime. It makes no difference if your function was waiting or doing real job, so from cost perspective my tests on developer instance, which shown the need to increase the timeout to ~30s, where not very promising.

4) Serverless means no daemon.
This may simplify development of “translator”, but will also limit your possibilities. In my case I had to remove a lot of in-memory caching, since it didn’t make sense. Subsequent call will start the process from scratch – it won’t “remember” what was the result from backend API we got 5s ago. Of course you can use another cloud service to store this. Interested in this – create a pull request or open an issue with suggestions 🙂 You can also use caching capabilities of AWS API Gateway, but… in case of API 2 API it’s highly probable that configuring this won’t be a piece of cake – for instance parts of the request may not be relevant, which means that you may get lower hit-rate than possible.

5) Hard-coding may not be bad idea.
Simply hardcoding configuration into scripts will reduce the number of dependencies. It’s important since if you need additional modules (not available in AWS Lambda by default) you have to create a deployment package [5]. It’s not difficult, especially if you’re developing your functions in virtualenv, but will have impact on price, since it will increase memory requirements (important if you’re going above minimum of 128MB and you’re not waiting 20s for backend reply – as I was :).

Finally, special thanks goes to @JanGarlaj who opened the feature request and gave me a few hints on how the implementation should look like.

You may be interested in other posts from this category.


Integration for ServiceNow table API and grafana.

Some time ago I wrote a blog post about my approach to ServiceNow and grafana integration, you can find it under this link[1]. The key concept used there is presented on the diagram below. Grafana snow integration scheme. Besides the technical aspects of integration, operational results were very good and reduced time incidents were spending in my “queue” – simply giving an overview of what is assigned to whom and what’s the status of tickets. However, due to the lack of flexibility in 1st version of snow-grafana-proxy implementation it was difficult to reuse it in other places. Attributes returned to grafana, lookup methods and the table were hard-coded. I decided to rewrite the service and here we are.

New version of snow-grafana-proxy available!

You can find new release available on projects github page[2]. I can say it’s in beta phase – there are no known issues. However, if you’ll encounter any difficulty just let me know opening an issue on project github page. (I’ve been testing it on Kingston developer instance.)

New configuration file

Configuration file format changed from ini to YAML. This change allowed much more structured configuration. In current state it’s possible to configure multiple queries, against any service-now table with arbitrary filters. Each value has a configurable “interpreter”, at the time of post publication available are:

  • none – Simply return value of the attribute specified as “name” argument.
  • map – Use “map” dictionary defined for attribute and send corresponding values assuming that value from service-now is a key for “map” dictionary.
  • object_attr_by_link – Assumes that for this attribute service-now API returns value/link pair. In this case additional HTTP request may be needed to get information available under the link. This interpreter requires additional parameters specified in interpreterParameters dictionary, for instance: interpreterParams: { linkAttribute: "name", default: "FailedToGetName"} will send to grafana value of the name attribute available under the link from previous get request. In case of failure interpreter will return the value “FailedToGetName”. Default value is important since sometimes the value is really undefined – like description of assignment group for unassigned incident. Those values are cached until snow-grafana-proxy restart which greatly reduce number of REST calls to service-now.

An example configuration file is available in the repository, let me quote it here:

As you see there are a few additional parameters I forgot to explain:

  • cacheTime – which will cache query replies for specified number of seconds, so if we get the information once and someone does the same query after a few seconds (another user having grafana dashboard opened or even the same one, but with very short auto refresh time) proxy will reply immediately with the same information.
  • snowFilter – the value passed to ServiceNow API as a request parameters. You can check available parameters in snow documentation[3].
  • table – quite self-explanatory, name of table you’d like to get information from, like incident, sc_task, change.

Resuls of the example configuration file, may look like on images below.

New command line options

Additionally to changed configuration file, this version supports two command line options. Explained in commands help:

 usage: [-h] [-c CONFIGFILE] [-d]

Simple JSON data source for grafana retreving data from service-now

optional arguments:
  -h, --help            show this help message and exit
                        Configuration file to use
  -d, --debug           Start in foreground and print all logs to stdout

Yes! By default snow-grafana-proxy will daemonizeitself, if you’d like to run in foreground for debug purpose just use -d option as explained above.

Feedback appreciated!


[VLOG] How to configure sssd idmap parameters to get minimal collision probability?

This post covers quite complicated topic of sssd configuration parameters for SID to UID/GID mapping. I promised to share this with you a few weeks ago. The key aspect here is to understand the principles of mapping algorithm implemented in sssd, which is something I described in previous post and vlog, however consequences may be not so obvious.

To help myself configuring parameters like ldap_idmap_range_max, ldap_idmap_range_min or ldap_idmap_range_size. I developed a mapping simulator – you can find the code on github. At the time of this post publication all parameters are hardcoded, but it’s very easy to adjust the script to your needs.

Interested in some details? Watch the video.

Ansible role used as plugin for simple REST API services.

For quite long time in our infrastructure we’ve been managing services in icinga2 based on set of lists defined in ansible variables. The advantage of this approach was nice icinga2 configuration with one file per host gathering all services. However, the direct consequence of this was the need to manage all monitoring endpoints within one ansible task – simply a template with definition of host and all “Apply” statements. At some point this approach started to be annoying because all the icinga definitions were in one role – different than software and monitoring plugins deployment. Second issue were files with variables reaching thousand lines (reading such a long yaml is not a pleasure). One day I thought – it’s time to switch to services creation over API, there must be an ansible module that will help us

How to dynamically create icinga2 services from ansible?

Surprisingly, when you check ansible website[1] you’ll notice that at the time of the post publication in main distribution there are two icinga2 modules – icinga2_feature and icinga2_host. I had some experience with icinga2 API, so my first thought was to develop icinga2_service module or join some project working on it. However, after a few minutes I realized that it’s really only about a few rest calls which can be implemented as ansible role applied to other roles as a dependency.

Icinga2 API has basically 4 calls for service objects, each realized by different HTTP method. Those calls are really self-explanatory:

  • GET, to check if service exist and read it’s attributes
  • PUT, to add new service
  • POST, to update existing one

Basic design for the role is to pass it information about the service to be monitored with parameter describing its state: deleted/present and dictionary with all necessary service attributes (name, check_comand, etc.). First task will just check if service exist and register result as a variable then if service state is present we create or update the existing service (based on the registered result) or delete the service if registered result of GET shows that service exist You can check full implementation github gist [2]. If you want to add specific monitoring service to software deployed in your role, you can simply add the below line to meta/main.yml

 - { role: icinga2-service, service: { state: present, command: PassiveCheck, name: TestService2}

Or apply it in a playbook, like here:

- hosts:
  - usinkok-acc01
  - icinga2-service
    name: TestService2
    state: present
    command: PassiveCheck

The only missing piece is the check if the service was changed during the update. Unfortunately, current icinga2 replay to POST in both cases is simply 200. Making it more difficult to correctly implement changed_when.

I’m checking the possibility to add this feature to icinga2. If I’ll be successful on this field I’ll let you know here :).
Nevertheless, what do you think about this approach? I thought about import_task as an alternative, but in this case I’m not sure where can I store some sensitive variables like API user secret password. If you want to share your ideas feel free to leave a comment.


[VLOG] How sssd maps SID to UID/GID

A few months ago I wrote a blog post analysing idmap – internal library of sssd. I received quite positive feedback about it, but also a few requests for something more high level. Explaining the basics without direct reference to sssd code.
Here we are – 1st video on funinit.
PS. The date in lower right corner is wrong and really not needed – forgive me, it’s 1st vpost 🙂

You may be also interested in next about sssd algorithmic mapping: How to configure sssd idmap parameters to get minimal colision probability

How to use job_submit_lua plugin with Slurm ?

Lua is scripting language implemented as a C library, which makes it perfect choice for small plugins to bigger C applications. It’s both efficient (since it’s really a C library, lua file can be read once and functions are normally executed multiple times) and easy to modify.

In case of queuing system like Slurm[1] there is always a need for customization that will fulfill organization specific requirements – like some preliminary checks done on job submission. In this case Slurm framework offers so called job_submit plugins, which are normally a shared libraries (.so) implementing two required functions job_submit and job_modify and two optional init and fini. While compiling .c into shared library is not a big deal there are situations where one would rather use script version, for those Slurm provides job_submit_lua plugin that simply calls lua script for real work. In this post I’ll cover the issues I met configuring it.

How to use job_submit_lua plugin in Slurm?

Compile Slurm with

When you run ./configure it has to discover lua libraries to link against them. In my case (Centos 6), when I just installed lua-devel (yum -y install lua-devel) and executed./configure I noticed

configure: WARNING: unable to locate lua package

and in config.log

[...]Package lua was not found in the pkg-config search path.                                                                                                                
Perhaps you should add the directory containing `lua.pc'                                                                                                                
to the PKG_CONFIG_PATH environment variable                                                                                                                             
No package 'lua' found                                                                                                                                                  
configure:24443: $? = 1                                                                                                                                                 
configure:24457: result: no                                                                                                                                             
No package 'lua' found                                                                                                                                                  
configure:24533: WARNING: unable to locate lua package

Simple check with manual execution of pkg-config, confirmed that it doesn’t exist

[root@hpc-slurmtest slurm-17.11.7]# pkg-config --exists --print-errors "lua-5.1"                                                                                        
Package lua-5.1 was not found in the pkg-config search path.                                                                                                            
Perhaps you should add the directory containing `lua-5.1.pc'                                                                                                            
to the PKG_CONFIG_PATH environment variable                                                                                                                             
No package 'lua-5.1' found

OK.. maybe CentOS 6 is too old and it has older version of lua, check:

[root@hpc-slurmtest slurm-17.11.7]# yum info lua-devel     | grep '^Version'                                                                                                             
Version     : 5.1.4

So maybe it doesn’t provide a .pc (package config file)

[root@hpc-slurmtest slurm-17.11.7]# rpm -ql lua-devel | grep ‘.pc’                           

File is there, but it’s name is not lua-5.1 or lua5.1 it’s just lua, let’s check if symlink creation will help the .m4 macro to find it:

ln -s  /usr/lib64/pkgconfig/lua.pc /usr/lib64/pkgconfig/lua-5.1.pc
checking for lua... yes
checking for whether we can link to liblua... yes lua

I tried to check how the macro is implemented, you can find it in auxdir/x_ac_lua.m4. I rewritten it to more standard use of PKG_CHECK_MODULE and asked SchedMD for opinion[2]. Additional research revealed that lua development team doesn’t provide pkg-config files[3]. They are created by distribution packagers, so they may differ between operating systems and the dirty implementation in Slurm autotools macro may be just fixing incompatibility issues.

Remember that if you play with macros used by autotools you have to recreated configure. Slurm follows quite standard way of doing it by make configure, but it looks like Makefile dependencies are not implemented correctly. I had to remove configure manually, because make didn't notice changes in x_ac_lua.

Where should I put my .lua script?

I got it compiled so could switch to implementation of my .lua script, but where to create it? Checking slurm.conf manual in section about JobSubmitPlugin you’ll find explanation that was quite enigmatic, at least for me when I read it:

For examples of use, see the Slurm code in "src/plugins/job_submit" and "contribs/lua/job_submit*.lua" then modify the code to satisfy your needs. Slurm can be configured to use multiple job_submit plugins if desired, however the lua plugin will only execute one lua script named "job_submit.lua" located in the default script directory (typically the subdirectory "etc" of the installation directory). No job submission plugins are used by default.

When I read this for the first time I was unable to understand where should I put my job_submit.lua, but quick check on src/plugin/job_submit/lua/job_submit_lua.c shown that location comes from DEFAULT_SCRIPT_DIR, which is

[root@hpc-slurmtest slurm-17.11.7]#  grep -ri  DEFAULT_SCRIPT_DIR ./src/plugins/job_submit/lua/Makefile
./src/plugins/job_submit/lua/Makefile:AM_CPPFLAGS = -DDEFAULT_SCRIPT_DIR=\"$(sysconfdir)\" \

Simply ${prefix}/etc in other words next to slurm.conf and I agree documentation stated it clearly 🙂

Develop the plugin

My goal was to develop a plugin that will prevent submission of jobs without explicit account specification (we found out that users don’t adjust account specification simply relying on the default, which caused some internal billing issues). I simply copied the example, changed the job_submit function to very simple:

function slurm_job_submit(job_desc, part_list, submit_uid)

        if job_desc.account == nil then
                slurm.log_user("You have to specify account. Usage of default accounts is forbidden.")
                return slurm.ESLURM_INVALID_ACCOUNT

Added JobSubmitPlugins=lua line to slurm.conf, restarted slurmctld and attempted to test it by srun --pty bash -l. Surprisingly error message, one from slurm.log_user(), was displayed but the job started, which means that my return statement didn’t really return an error. I just used ESLURM_INVALID_ACCOUT since it’s one of the error values defined in slurm/slurm_error.h and was the most suitable for me. However, how is it passed to lua ?

Dig into Slurm code one more time you'll see it’s passed to the lua library in job_submit plugin init function and not all error codes are there. Now the sentence I like: "It’s open source so just add what you need". One more time create a patch and send it to main development team to check if it's fine and to be able to use my script after Slurm upgrade. It was accepted in a few hours[4] and job_submit script worked as expected:

[root@hpc-slurmtest slurm]# srun --pty bash -l
srun: error: You have to specify account. Usage of default accounts is forbidden.
srun: error: Unable to allocate resources: Invalid account or account/partition combination specified


Recovering orphaned vmware VMs from Power Shell

As a result of vmware cluster outage, I’ve run into situation were some VMs were “orphaned”. The way to have it recovered is to remove VMs object (if one exist) from inventory and then register a new VM from vmx file we have on data store. If you have a substantial number of VMs in this state, the only reasonable way to do that is to use PowerShell script. Searching the web you can find more than a few long and complicated scripts designed to deal with this situation. However, probably I’m not the only one sceptical about using scripts from the web on production environment in critical state without fulling checking what the script does. Finally, this takes more time then development of your own script. So instead of sharing long .ps I’ll try to gather short snippets of code addressing bits of the “exercise”.
Options -RunAsync and -Confirmed:$false are obviously optional, but if you’re trying to automate the process both are helpful.

Recovering orphaned vmware VMs from Power Shell

How to find .vmx file for my orphaned VM?

asnp *vmware*
Connect-VIServer -Server myVcenterServerAddress
$DS=Get-Datastore -Name $myDS
$DSView=Get-View -Id $
$DSBrowser = get-view $DSView.browser
$SearchSpecObj = New-Object VMware.Vim.HostDatastoreBrowserSearchSpec
$searchRes = $DSBrowser.SearchDatastoreSubFolders(( "[" + $myDS + "]" ), $SearchSpecObj)

foreach ( $dir in $searchRes )
    $VMXFile = ($dir.file | where { $_.Path -like "*.vmx" })
    if ( $VMXFile.Path -like $myVM)
          Write-Host $VMXFile.Path

SearchDatastoreSubFolders – may take a while to execute, but you can comment it out if your executing the script subsequently from PowerShell ISE.

How to register a vm from .vmx file?

New-VM -VMFilePath $VMXPath -VMHost $vmHost -RunAsync

Remember that this will not start the VM, you need to execute Start-VM

What is the valid path for -VMFileName option? (“[…] is not valid path to a virtual machine”)

The correct format for this option is: '[yourDataStoreName] directory/yourvm.vmx', if you are in the foreach loop from the 1st listing, you can build it like here:

  foreach ($dir in $searchRes )
    $VMFolder = (($dir.FolderPath.Split("]").trimstart())[1]).trimend('/')
    $VMXFile = ($dir.file | where { $_.Path -like "*.vmx" })
    $VMXPath = ("[" + $DSName + "] " +  $VMFolder + "/" + $VMXFile.Path)

Noticed the space after “]” ? – It’s required.

How to find VMHost (ESX) appropriate for my VM?

If you have an orphaned inventory entry you can still get this as:

$vmHost=(Get-VM -Name $myVM)

How to remove orphan entry from the inventory?

(“The specified key, name, or identifier already exists”)
Remember that you should get the VMHost from this entry before you remove it.

Remove-VM -VM VM-NAME	-Confirm:$false

Why my ansible-playbook hangs in “Gathering facts”?

Did it happen to you that ansible hanged in “Gathering facts”? It happened to me today, the root cause discovery was important for me, but the debugging method may be quite similar for other issues.

How to debug ansible hanging in “Gathering facts”?

First thing you should do is to execute ansible with -vvv options to increase verbosity. In my case blue font, which is the default for verbose messages, is not very visible on the screen, so I’m adding ANSIBLE_NOCOLOR=yes as environment variable. Finally, the command looks like:

# ANSIBLE_NOCOLOR=yes ansible-playbook  -vvv  -i ./hosts -l host01
-DC -t rsyslog -DC ./common.yml 
 SSH: EXEC ssh -C -o ControlMaster=auto -o ControlPersist=60s -o StrictHostKeyChecking=no -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o ConnectTimeout=30 -o ControlPath=/home/user/.ansible/cp/02cc38bba9 -tt '/bin/sh -c '"'"'sudo -H -S -n -u ro
ot /bin/sh -c '"'"'"'"'"'"'"'"'echo BECOME-SUCCESS-ltxvshvezrnmumzdprccoiekhjheuwxt; /usr/bin/python /home/user/.ansible/tmp/ansible-tmp-1527099315.31-224479822965785/'"'"'"'"'"'"'"'"' && sleep 0'"'"''

The execution hanged here, so we know that ssh to the host worked. We have the command that was executed. As we could guess it’s, because gathering facts is done by setup module. The file is probably still on the host since the execution hanged and ansible didn’t execute cleanup on files in .ansible directory. If in your case it is rather failure then hang-up and you cannot find those files just add DEFAULT_KEEP_REMOTE_FILES=yes to your ansible-playbook execution, like here:

#DEFAULT_KEEP_REMOTE_FILES=yes ansible-playbook  -vvv  -i ./hosts -l host01
-DC -t rsyslog -DC ./common.yml 

Then login to the host01 as the user used to span ansible tasks on the host, and simply execute:

strace -f /bin/sh -c “echo BECOME-SUCCESS-ltxvshvezrnmumzdprccoiekhjheuwxt; /usr/bin/python /home/user/.ansible/tmp/ansible-tmp-1527099315.31-224479822965785/ && sleep 0”

In my case it hanged at execution of statfs system call on one of NFS filesystems. BUM! The issue found, easily verified with df /on/appropriate/path, which also hanged.

Other approach that may be helpful in different cases, especially bugs in, may be to execute the module under pyton debuger like pdb. This approach is greatly covered on Will Thames’ tech blog post