[01:08:18] 10Tool-Labs-tools-Pageviews: Do not show log scale when all values are below 10 - https://phabricator.wikimedia.org/T140910#2482656 (10MusikAnimal) 05Open>03Resolved a:03MusikAnimal Fixed with https://github.com/MusikAnimal/pageviews/commit/d6221bdb37a1168f2344f8ada75657fd7f345417 [01:08:49] 10Tool-Labs-tools-Pageviews: Make "stepSize" of Y-axis no smaller than one, so only show integers - https://phabricator.wikimedia.org/T140784#2482659 (10MusikAnimal) 05Open>03Resolved a:03MusikAnimal Fixed with https://github.com/MusikAnimal/pageviews/commit/d6221bdb37a1168f2344f8ada75657fd7f345417 [01:10:36] 10Tool-Labs-tools-Pageviews: Add URL parameters to show/hide log scale and start Y-axis from zero - https://phabricator.wikimedia.org/T140783#2482662 (10MusikAnimal) 05Open>03Resolved The option actually just disables "auto-log detection", so the option to show the log scale is there, it just won't be shown... [04:02:06] 06Labs, 10Labs-Infrastructure, 06Discovery, 06Maps, and 2 others: Update coastline data in OSM postgres db (osmdb.eqiad.wmnet) - https://phabricator.wikimedia.org/T140296#2459518 (10Dzahn) just some technical notes: osmdb.eqiad.wmnet is an alias for labsdb1006.eqiad.wmnet cheat sheet for shp2pgsql htt... [04:32:04] Change on 12wikitech.wikimedia.org a page Nova Resource:Tools/Access Request/Gammawave was modified, changed by Tim Landscheidt link https://wikitech.wikimedia.org/w/index.php?diff=771760 edit summary: [04:35:32] 10Labs-project-wikistats: wikistats: add tcy.wikipedia (and check for other missing ones) - https://phabricator.wikimedia.org/T140970#2482793 (10Dzahn) [04:36:18] 10Labs-project-wikistats: wikistats: add tcy.wikipedia (and check for other missing ones) - https://phabricator.wikimedia.org/T140970#2482806 (10Dzahn) [05:41:04] how to connetct tools-login.wmflabs.org via ftp(can not connect to server) [05:41:39] i used pagent [05:41:55] pageant [05:46:01] ??? [05:53:06] ??? [06:33:42] 06Labs, 10Tool-Labs, 06Operations, 10Phabricator, and 2 others: Install Arcanist in toollabs::dev_environ - https://phabricator.wikimedia.org/T139738#2441209 (10greg) rOPUP:modules/toollabs/manifests/dev_environ.pp already has differences for what is installed and not just version, but software themselves... [08:40:24] 06Labs, 10Beta-Cluster-Infrastructure, 07Blocked-on-Operations, 13Patch-For-Review, 07Puppet: /etc/puppet/puppet.conf keeps getting double content - first for labs-wide puppetmaster, then for the correct puppetmaster - https://phabricator.wikimedia.org/T132689#2482979 (10hashar) I havent seen that occurr... [08:48:11] 06Labs, 10Labs-Infrastructure, 10Shinken: labmon1001 renderer requires authentication breaking labs Shinken probes - https://phabricator.wikimedia.org/T140976#2483016 (10hashar) [08:48:21] 06Labs, 10Labs-Infrastructure, 10Shinken: labmon1001 renderer requires authentication breaking labs Shinken probes - https://phabricator.wikimedia.org/T140976#2483030 (10hashar) p:05Triage>03High [08:48:59] 06Labs, 07Graphite, 13Patch-For-Review: Setup "official labs grafana" instance - https://phabricator.wikimedia.org/T120295#1850400 (10hashar) Yesterday spring most probably has broken the labs Shinken that got a 401 trying to reach labmon1001: T140976 [08:59:05] (03PS1) 10Lokal Profil: [Don't merge] Add Albania and Kosovo to the database [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/300233 (https://phabricator.wikimedia.org/T140488) [09:04:19] 06Labs, 10Labs-Infrastructure, 06Operations: investigate slapd memory leak - https://phabricator.wikimedia.org/T130593#2483051 (10fgiunchedi) serpens still shows some memory growth, possibly not fixed yet {F4293977} [10:45:18] (03CR) 10Lokal Profil: [C: 031] "Looks fine." [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/299891 (https://phabricator.wikimedia.org/T140795) (owner: 10Jean-Frédéric) [11:10:54] (03CR) 10Jean-Frédéric: [C: 032] "No, I was just assuming you would have been added automatically. :) Thanks for the review!" [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/299891 (https://phabricator.wikimedia.org/T140795) (owner: 10Jean-Frédéric) [11:13:02] (03Merged) 10jenkins-bot: Harvest Wikidata item for Canada in English ca_(en) [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/299891 (https://phabricator.wikimedia.org/T140795) (owner: 10Jean-Frédéric) [11:59:06] 06Labs, 10Beta-Cluster-Infrastructure, 07Blocked-on-Operations, 13Patch-For-Review, 07Puppet: /etc/puppet/puppet.conf keeps getting double content - first for labs-wide puppetmaster, then for the correct puppetmaster - https://phabricator.wikimedia.org/T132689#2206880 (10AlexMonk-WMF) I saw it just a few... [12:55:59] 06Labs, 10Tool-Labs, 10MediaWiki-Interwiki, 10MediaWiki-extensions-Interwiki, 10Wikimedia-Interwiki-links: Toollabs interwiki doesn't work when url parameters needed - https://phabricator.wikimedia.org/T140981#2483286 (10Dvorapa) [13:22:48] 06Labs, 10Labs-Infrastructure, 10Shinken, 13Patch-For-Review: labmon1001 renderer requires authentication breaking labs Shinken probes - https://phabricator.wikimedia.org/T140976#2483339 (10yuvipanda) a:03yuvipanda [14:16:43] wikimedia/nagf#49 (master - 3bcf213: Yuvi Panda) The build passed. - https://travis-ci.org/wikimedia/nagf/builds/146388742 [14:17:27] Krinkle ^ jfyi (I merged it directly) to update graphite URL (old one continues to redirect appropriately) [14:31:58] 06Labs, 10Labs-Infrastructure: Recapture unused floating IPs - https://phabricator.wikimedia.org/T140985#2483421 (10Andrew) [14:32:19] 06Labs, 10Labs-Infrastructure: Recapture unused floating IPs - https://phabricator.wikimedia.org/T140985#2483421 (10yuvipanda) The tools project uses 50 of these, need to revisit and kill. [14:33:57] 06Labs, 06Operations, 13Patch-For-Review: Move labs graphite to graphite-labs.wikimedia.org - https://phabricator.wikimedia.org/T140899#2483436 (10yuvipanda) 05Open>03Resolved a:03yuvipanda Done and left redirects in place! [14:35:55] 06Labs, 10Labs-Infrastructure, 10Shinken, 13Patch-For-Review: labmon1001 renderer requires authentication breaking labs Shinken probes - https://phabricator.wikimedia.org/T140976#2483439 (10yuvipanda) 05Open>03Resolved Thanks for reporting :D is sorted out now. [14:36:37] 06Labs, 07Graphite, 13Patch-For-Review: Setup "official labs grafana" instance - https://phabricator.wikimedia.org/T120295#2483441 (10yuvipanda) Is setup to allow login to https://grafana-labs-admin.wikimedia.org to anyone who is a member of any labs project. Still can't see the dashboards I save there in ht... [14:44:00] 06Labs, 10Tool-Labs: Webservice on Tools Labs fails repeatedly - https://phabricator.wikimedia.org/T115231#2483454 (10russblau) It is currently down again. Shell shows the following: ``` tools.dplbot@tools-bastion-03:~$ kubectl get pod NAME READY STATUS RESTARTS AGE dplbot-14457... [14:45:09] 06Labs, 10Tool-Labs: Webservice on Tools Labs fails repeatedly - https://phabricator.wikimedia.org/T115231#2483468 (10yuvipanda) Don't, am looking at it just now. [14:47:06] (03CR) 10Lokal Profil: [C: 032] Add two known fields to fr (fr) [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/299886 (owner: 10Jean-Frédéric) [14:47:52] (03Merged) 10jenkins-bot: Add two known fields to fr (fr) [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/299886 (owner: 10Jean-Frédéric) [14:48:33] 06Labs, 10Tool-Labs: Webservice on Tools Labs fails repeatedly - https://phabricator.wikimedia.org/T115231#2483472 (10yuvipanda) Hmm, I fixed it (required a restart of kube2proxy layer). I'll file a separate bug to investigate this just now. [14:50:55] 06Labs, 10Labs-Kubernetes, 10Tool-Labs: Monitor kube2proxy failures - https://phabricator.wikimedia.org/T140988#2483474 (10yuvipanda) [15:06:03] (03CR) 10Lokal Profil: [C: 032] Harvest Wikidata item in gb-eng (en) [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/299890 (https://phabricator.wikimedia.org/T140795) (owner: 10Jean-Frédéric) [15:06:55] (03Merged) 10jenkins-bot: Harvest Wikidata item in gb-eng (en) [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/299890 (https://phabricator.wikimedia.org/T140795) (owner: 10Jean-Frédéric) [15:23:51] 10Wikibugs, 13Patch-For-Review: Do not notify #Trash tasks to IRC - https://phabricator.wikimedia.org/T140426#2464291 (10Luke081515) Alternatvily, we can [[https://phabricator.wikimedia.org/project/manage/89/#24407 change the name back]], before the rename it worked. [15:27:10] 10Wikibugs, 13Patch-For-Review: Do not notify #Trash tasks to IRC - https://phabricator.wikimedia.org/T140426#2464291 (10Dzahn) We just gotta watch out we are not creating more meta notifications about not creating notifications than by just ..not touching all the trash tasks. [15:51:08] 06Labs, 10Horizon, 13Patch-For-Review: Investigate (and probably disable) 'rebuild instance' option - https://phabricator.wikimedia.org/T140259#2483686 (10Andrew) 05Open>03Resolved [16:12:53] 06Labs, 10Tool-Labs: Write diamond collector for gridengine job count stats - https://phabricator.wikimedia.org/T140999#2483810 (10yuvipanda) [16:13:49] PROBLEM - Puppet run on tools-services-02 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [16:14:20] 06Labs, 10Tool-Labs, 06Community-Tech-Tool-Labs: Deploy "Striker" Tool Labs console to WMF production - https://phabricator.wikimedia.org/T136256#2483827 (10bd808) Gerrit repos for `labs/striker` and `labs/striker/wheels` [[https://www.mediawiki.org/w/index.php?title=Git%2FNew_repositories%2FRequests%2FEntri... [16:19:58] YuviPanda: okay, thanks [16:40:10] 06Labs, 10Tool-Labs, 06Community-Tech-Tool-Labs, 10Striker: Deploy "Striker" Tool Labs console to WMF production - https://phabricator.wikimedia.org/T136256#2483939 (10bd808) [16:40:23] 06Labs, 10Tool-Labs, 06Community-Tech-Tool-Labs, 10Diffusion, and 2 others: Create application to manage Diffusion repositories for a Tool Labs project - https://phabricator.wikimedia.org/T133252#2483940 (10bd808) [16:40:34] 06Labs, 10Tool-Labs, 06Community-Tech-Tool-Labs, 10Security-Reviews, 10Striker: Security review of Tool Labs console application - https://phabricator.wikimedia.org/T135784#2483942 (10bd808) [16:41:12] andrewbogott: there? [16:41:19] what's up? [16:42:05] 06Labs, 10Labs-Infrastructure, 10DBA, 10Striker: Investigate moving labsdb (replicas) user credential management to 'Striker' (codename) - https://phabricator.wikimedia.org/T140832#2483968 (10bd808) [16:42:07] zhuyifei1999_: ? [16:43:01] I haven't monitored the tool closely this week, but the nagf metrics for encoding02 looks weird [16:43:12] http://tools.wmflabs.org/nagf/?project=video [16:44:53] (mysterious system cpu usage what doesn't decrease total cpu usage, nice cpu using twice as shows (are the ghost ones the mysterious "stolen" cpu?)) [16:44:59] etc [16:45:11] is that a bug or something? [16:47:04] andrewbogott: [16:47:20] I'm looking but I really don't understand the question [16:47:35] it looks like a normal CPU graph to me… mostly idle, briefly busy while processing a job, back to idle [16:47:57] http://graphite-labs.wikimedia.org/render/?title=encoding02+CPU+last+week&width=800&height=250&from=-1week&hideLegend=false&uniqueLegend=true&target=alias%28color%28stacked%28video.encoding02.cpu.total.user%29%2C%22%233333bb%22%29%2C%22User%22%29&target=alias%28color%28stacked%28video.encoding02.cpu.total.nice%29%2C%22%23ffea00%22%29%2C%22Nice%22%29&target=a [16:47:57] lias%28color%28stacked%28video.encoding02.cpu.total.system%29%2C%22%23dd0000%22%29%2C%22System%22%29&target=alias%28color%28stacked%28video.encoding02.cpu.total.iowait%29%2C%22%23ff8a60%22%29%2C%22Wait+I%2FO%22%29&target=alias%28alpha%28color%28stacked%28video.encoding02.cpu.total.idle%29%2C%22%23e2e2f2%22%29%2C0.4%29%2C%22Idle%22%29 [16:48:03] the weekly one [16:48:50] RECOVERY - Puppet run on tools-services-02 is OK: OK: Less than 1.00% above the threshold [0.0] [16:49:49] zhuyifei1999_: so, I've never looked at a nagf graph before. But it still looks normal to me, I think I don't understand what you're asking :( [16:50:05] ... [16:50:44] see the part near 7/17 line and 7/21 line [16:50:59] notice the idle cpu [16:51:12] you mean where 'idle' goes about 100% for a bit? [16:51:46] um… s/about/above/ [16:52:03] at 7/17 line it's 0% with the rest at about 50%, so 50% missing [16:53:09] at 7/21 it's about 100% with system cpu at about 7%. so about 107% total [16:53:29] and It doesn't seem to make sense [16:53:37] at 7/17 I see see 'nice' at about 50% and idle at about 50% [16:54:56] uh, its it generating different graphs? [16:55:20] hmm weird, refreshed and it seems fine [16:55:22] maybe :( [16:56:00] uh refreshed again and it's back [16:57:12] https://usercontent.irccloud-cdn.com/file/IK3IS2Ep/Screenshot%20from%202016-07-22%2000%3A56%3A20.png [16:58:55] My graph doesn't look like that :( But yeah, it's obviously not subtracting System CPU from idle [16:58:57] (doesn't seem to be reproduceable again) [16:59:01] looks like a misplaced +/- in the code [16:59:16] that or it's doing something clever and allocation an additional vcpu for system stuff, but that seems less likely [17:00:51] yeah, the total is staying at 100% now, and refreshes aren't showing that buggy graph again [17:01:53] sorry if that wasted some of your time [17:03:45] 06Labs, 10Deployment-Systems, 10wikitech.wikimedia.org: /etc/mediawiki/WikitechPrivateSettings.php not found on tin - https://phabricator.wikimedia.org/T139917#2484108 (10Dereckson) [17:04:40] * zhuyifei1999_ still wonders where that system cpu usage data comes from [17:05:46] 06Labs, 10wikitech.wikimedia.org, 07Upstream, 07Wikimedia-log-errors: PHP array to string conversion on wikitech in SMW 1.8.x - https://phabricator.wikimedia.org/T124235#2484147 (10Dereckson) [17:05:48] 06Labs, 10wikitech.wikimedia.org: Upgrade SMW to 1.9 or later - https://phabricator.wikimedia.org/T62886#2484146 (10Dereckson) [17:06:46] 06Labs, 10wikitech.wikimedia.org, 07Wikimedia-log-errors: PHP array to string conversion on wikitech in SMW 1.8.x - https://phabricator.wikimedia.org/T124235#1949771 (10Dereckson) >>! @kghbln closed the issue upstream and stated: > Thanks for reporting. This indeed was an issue with SMW 1.8.x and MediaWiki... [17:09:44] 06Labs, 10Labs-Kubernetes, 10Tool-Labs: etcd hosts hanging with kernel hang - https://phabricator.wikimedia.org/T140256#2484199 (10yuvipanda) p:05Triage>03High tools-worker-1018 hit this just now. [17:14:54] zhuyifei1999_: no problem, it's weird [17:15:17] * tom29739 wonders where it comes from too [17:27:31] 06Labs, 10Labs-Kubernetes, 10Tool-Labs, 13Patch-For-Review: Add a diamond collector for kubernetes usage stats - https://phabricator.wikimedia.org/T140887#2484331 (10yuvipanda) 05Open>03Resolved We also have prometheues. [17:30:05] 06Labs, 10Labs-Kubernetes, 10Tool-Labs: Kubernetes worker nodes hanging - https://phabricator.wikimedia.org/T141017#2484363 (10yuvipanda) [17:37:42] 06Labs, 10Labs-Kubernetes, 10Tool-Labs: Kubernetes worker nodes hanging - https://phabricator.wikimedia.org/T141017#2484400 (10yuvipanda) [17:38:03] 06Labs, 10Labs-Kubernetes, 10Tool-Labs: etcd hosts hanging with kernel hang - https://phabricator.wikimedia.org/T140256#2484403 (10yuvipanda) Split off tracking worker nodes into T141017 [17:38:39] 06Labs, 10Labs-Kubernetes, 10Tool-Labs: etcd hosts hanging with kernel hang - https://phabricator.wikimedia.org/T140256#2484405 (10yuvipanda) I'm upgrading all etcd hosts to newer kernel (4.4) [17:44:33] PROBLEM - Puppet run on tools-k8s-etcd-01 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [0.0] [17:48:15] PROBLEM - Puppet run on tools-k8s-etcd-02 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [0.0] [17:53:18] RECOVERY - Puppet run on tools-k8s-etcd-02 is OK: OK: Less than 1.00% above the threshold [0.0] [17:54:34] RECOVERY - Puppet run on tools-k8s-etcd-01 is OK: OK: Less than 1.00% above the threshold [0.0] [17:58:37] 06Labs, 10Labs-Kubernetes, 10Tool-Labs: Kubernetes worker nodes hanging - https://phabricator.wikimedia.org/T141017#2484507 (10yuvipanda) [18:03:18] PROBLEM - Puppet run on tools-flannel-etcd-02 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [0.0] [18:13:17] RECOVERY - Puppet run on tools-flannel-etcd-02 is OK: OK: Less than 1.00% above the threshold [0.0] [18:15:24] 06Labs, 10Tool-Labs: Normalize kernel on all Debian Jessie nodes in tools - https://phabricator.wikimedia.org/T140611#2484619 (10yuvipanda) Done for: 1. tools-k8s-etcd-0[1-3] 2. tools-flannel-etcd-0[1-3] 3. tools-k8s-master-01 4. tools-docker-registry-01 [18:16:11] 06Labs, 10Labs-Kubernetes, 10Tool-Labs: Create failover host for docker registry - https://phabricator.wikimedia.org/T141030#2484623 (10yuvipanda) [18:18:18] 06Labs, 10Labs-Kubernetes, 10Tool-Labs: Create failover host for docker registry - https://phabricator.wikimedia.org/T141030#2484636 (10yuvipanda) This is made difficult by the fact that: 1. We don't have a Swift cluster 2. Docker's registry model requires that the name of the host be present in the tag, so... [19:02:36] 06Labs, 10wikitech.wikimedia.org, 13Patch-For-Review: Install wikitech private settings directly onto wikitech hosts - https://phabricator.wikimedia.org/T124732#1964569 (10Dereckson) Side effect of this solution: T140889 [19:37:04] PROBLEM - High iowait on tools-worker-1018 is CRITICAL: CRITICAL: tools.tools-worker-1018.cpu.total.iowait (>100.00%) [19:40:21] chasemp ^ seems to work [19:40:50] that's kind of a misnomer since it can't actually check it? or is that node submitting iowait numbers? [19:40:57] that's the dead node right? [19:43:12] chasemp it is submitting iowait numbers [19:43:28] well that's super interesting [19:43:49] chasemp yeah. when tools-worker-1004 died for example, you can see that it was sending no metrics [19:43:51] not the case here [19:44:30] it died off for iostat tho [19:44:30] https://graphite-labs.wikimedia.org/render/?width=1051&height=474&_salt=1469130163.848&target=tools.tools-worker-1018.iostat.dm-0.await&hideLengend=false&from=-48h [19:45:12] https://graphite-labs.wikimedia.org/render/?width=1051&height=474&_salt=1469130163.848&target=tools.tools-worker-1018.iostat.dm-0.await&hideLengend=false&from=-48h&lineMode=connected [19:45:24] do we know when it went offline YuviPanda^ [19:45:30] I wonder if aligns w/ iostat death [19:45:31] nope [19:45:34] iostat collection even [19:45:38] I usually look at kern.log after I reboot [19:45:44] dollars to donuts that lines up [19:46:06] woah I actually got a shell on it [19:46:08] lost it again [19:46:25] hm [19:46:31] (i had it opened a long time ago and it had actually gone through) [19:46:42] The last Puppet run was at Thu Jul 21 10:43:51 UTC 2016 (542 minutes ago). [19:46:47] I get to there [19:46:52] chasemp I have a shell [19:46:55] what's even more odd is [19:47:01] load isn't increasing [19:47:02] if you go in as root and when it 'hangs' if you ctrl-c you get a shell [19:47:04] if it really has things in d-wait [19:47:08] hm [19:47:14] I never get that far when I've tried [19:47:16] thus far [19:47:18] and then I run 'top' and it hangs [19:47:29] ssh root@tools-worker-1018.eqiad.wmflabs [19:47:29] ? [19:47:39] yeah [19:47:47] and basically it enter a few times and then ctrl-c a few times [19:48:06] no dice for me so far [19:48:08] but [19:48:24] yeah, a ps auxf hung my kernel [19:48:25] err, shell [19:48:26] iotop? [19:49:13] pidstat -l -d [19:49:14] hm [19:49:14] trying to get a new shell now [19:49:29] can you fork it and try to dump to a file? [19:49:37] god we desperately need console access [19:50:07] pidstat -l -d hung [19:50:17] chasemp what do you mean by 'fork it and try to dump to a file' [19:50:49] looks like I can run bash builtins and nothing else [19:50:56] cat works [19:51:03] (I have to go soon though) [19:51:07] k [19:51:42] chasemp I got a stack trace in dmesg! [19:51:48] nice [19:52:00] paste it? [19:52:23] 06Labs, 10Labs-Kubernetes, 10Tool-Labs, 13Patch-For-Review: Kubernetes worker nodes hanging - https://phabricator.wikimedia.org/T141017#2484932 (10yuvipanda) Found this stack trace in dmesg for tools-worker-1018: ``` [Thu Jul 21 11:10:11 2016] INFO: task jbd2/vda3-8:143 blocked for more than 120 seconds.... [19:52:23] chasemp https://phabricator.wikimedia.org/T141017 [19:53:19] chasemp I saved dmesg into /srv/dmesglogs as well [19:53:25] k [19:53:35] I don't see anything illuminating there yet [19:54:13] [Thu Jul 21 11:06:11 2016] INFO: task jbd2/vda3-8:143 blocked for more than 120 seconds. [19:54:14] 06Labs, 10Labs-Kubernetes, 10Tool-Labs, 13Patch-For-Review: Kubernetes worker nodes hanging - https://phabricator.wikimedia.org/T141017#2484933 (10yuvipanda) First report of the error is at ``` [Thu Jul 21 11:06:11 2016] INFO: task jbd2/vda3-8:143 blocked for more than 120 seconds. ``` [19:54:14] first report [19:54:39] what size are these nodes? [19:54:42] xlarge? [19:55:46] chasemp large [19:55:56] chasemp etcd ones are small [20:00:03] chasemp I munged around in /proc and found list of running processes. these include ones that should've been killed when I depooled it [20:00:11] Hello all :) [20:00:16] * d3r1ck checks in [20:01:19] YuviPanda: pidstat -l -d [20:01:23] does that run? [20:01:31] nope [20:01:50] well that sucks [20:01:50] all of my shells also hang after a few minutes [20:01:52] and I open a new on [20:01:53] e [20:01:57] I got xargs to work [20:02:07] NOHUP pidstat -l -d & [20:02:36] -bash: NOHUP: command not found [20:02:40] ? [20:02:55] is it case sensitive :) [20:02:55] bah all small [20:02:58] sheepishly I suggest [20:02:59] but now my terminal is gone again [20:03:13] nohup pidstat -l -d > /tmp/foo & [20:03:14] idk [20:03:15] worht a shot [20:03:19] worth a shot [20:03:50] there's a foo in /srv [20:03:51] but is empty [20:03:59] and I gotta go... [20:04:32] k [20:04:34] good travels [20:05:04] ty [20:34:13] 10Labs-Kubernetes: Odd kubernetes error - https://phabricator.wikimedia.org/T141041#2485077 (10Magnus) [21:48:11] 06Labs, 10Labs-Kubernetes, 10Tool-Labs, 13Patch-For-Review: Kubernetes worker nodes hanging - https://phabricator.wikimedia.org/T141017#2484363 (10hashar) jbd2_journal_commit_transaction and overall diskio borked somehow. We had that for a while on CI slaves end of June. A console trace is P3278 and tas... [22:42:27] !log tools reboot tools-worker-1018 as stuck T141017 [22:42:28] T141017: Kubernetes worker nodes hanging - https://phabricator.wikimedia.org/T141017 [22:42:31] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [22:48:51] PROBLEM - Puppet staleness on tools-worker-1018 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [43200.0] [22:48:57] 06Labs, 10Labs-Kubernetes, 10Tool-Labs, 13Patch-For-Review: Kubernetes worker nodes hanging - https://phabricator.wikimedia.org/T141017#2485710 (10chasemp) [22:57:01] RECOVERY - High iowait on tools-worker-1018 is OK: OK: All targets OK [22:58:51] RECOVERY - Puppet staleness on tools-worker-1018 is OK: OK: Less than 1.00% above the threshold [3600.0]