[01:08:18] <wikibugs>	 10Tool-Labs-tools-Pageviews: Do not show log scale when all values are below 10 - https://phabricator.wikimedia.org/T140910#2482656 (10MusikAnimal) 05Open>03Resolved a:03MusikAnimal Fixed with https://github.com/MusikAnimal/pageviews/commit/d6221bdb37a1168f2344f8ada75657fd7f345417
[01:08:49] <wikibugs>	 10Tool-Labs-tools-Pageviews: Make "stepSize" of Y-axis no smaller than one, so only show integers - https://phabricator.wikimedia.org/T140784#2482659 (10MusikAnimal) 05Open>03Resolved a:03MusikAnimal Fixed with https://github.com/MusikAnimal/pageviews/commit/d6221bdb37a1168f2344f8ada75657fd7f345417
[01:10:36] <wikibugs>	 10Tool-Labs-tools-Pageviews: Add URL parameters to show/hide log scale and start Y-axis from zero - https://phabricator.wikimedia.org/T140783#2482662 (10MusikAnimal) 05Open>03Resolved The option actually just disables "auto-log detection", so the option to show the log scale is there, it just won't be shown...
[04:02:06] <wikibugs>	 06Labs, 10Labs-Infrastructure, 06Discovery, 06Maps, and 2 others: Update coastline data in OSM postgres db (osmdb.eqiad.wmnet) - https://phabricator.wikimedia.org/T140296#2459518 (10Dzahn) just some technical notes:  osmdb.eqiad.wmnet  is  an alias for  labsdb1006.eqiad.wmnet  cheat sheet for shp2pgsql htt...
[04:32:04] <wm-bot>	 Change on 12wikitech.wikimedia.org a page Nova Resource:Tools/Access Request/Gammawave was modified, changed by Tim Landscheidt link https://wikitech.wikimedia.org/w/index.php?diff=771760 edit summary: 
[04:35:32] <wikibugs>	 10Labs-project-wikistats: wikistats: add tcy.wikipedia (and check for other missing ones) - https://phabricator.wikimedia.org/T140970#2482793 (10Dzahn)
[04:36:18] <wikibugs>	 10Labs-project-wikistats: wikistats: add tcy.wikipedia (and check for other missing ones) - https://phabricator.wikimedia.org/T140970#2482806 (10Dzahn)
[05:41:04] <mehrdad>	 how to connetct tools-login.wmflabs.org via ftp(can not connect to server)
[05:41:39] <mehrdad>	 i used pagent
[05:41:55] <mehrdad>	 pageant
[05:46:01] <mehrdad>	 ???
[05:53:06] <mehrdad>	 ???
[06:33:42] <wikibugs>	 06Labs, 10Tool-Labs, 06Operations, 10Phabricator, and 2 others: Install Arcanist in toollabs::dev_environ - https://phabricator.wikimedia.org/T139738#2441209 (10greg) rOPUP:modules/toollabs/manifests/dev_environ.pp already has differences for what is installed and not just version, but software themselves...
[08:40:24] <wikibugs>	 06Labs, 10Beta-Cluster-Infrastructure, 07Blocked-on-Operations, 13Patch-For-Review, 07Puppet: /etc/puppet/puppet.conf keeps getting double content - first for labs-wide puppetmaster, then for the correct puppetmaster - https://phabricator.wikimedia.org/T132689#2482979 (10hashar) I havent seen that occurr...
[08:48:11] <wikibugs>	 06Labs, 10Labs-Infrastructure, 10Shinken: labmon1001 renderer requires authentication breaking labs Shinken probes - https://phabricator.wikimedia.org/T140976#2483016 (10hashar)
[08:48:21] <wikibugs>	 06Labs, 10Labs-Infrastructure, 10Shinken: labmon1001 renderer requires authentication breaking labs Shinken probes - https://phabricator.wikimedia.org/T140976#2483030 (10hashar) p:05Triage>03High
[08:48:59] <wikibugs>	 06Labs, 07Graphite, 13Patch-For-Review: Setup "official labs grafana" instance - https://phabricator.wikimedia.org/T120295#1850400 (10hashar) Yesterday spring most probably has broken the labs Shinken that got a 401 trying to reach labmon1001: T140976
[08:59:05] <grrrit-wm>	 (03PS1) 10Lokal Profil: [Don't merge] Add Albania and Kosovo to the database [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/300233 (https://phabricator.wikimedia.org/T140488) 
[09:04:19] <wikibugs>	 06Labs, 10Labs-Infrastructure, 06Operations: investigate slapd memory leak - https://phabricator.wikimedia.org/T130593#2483051 (10fgiunchedi) serpens still shows some memory growth, possibly not fixed yet  {F4293977}
[10:45:18] <grrrit-wm>	 (03CR) 10Lokal Profil: [C: 031] "Looks fine." [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/299891 (https://phabricator.wikimedia.org/T140795) (owner: 10Jean-Frédéric)
[11:10:54] <grrrit-wm>	 (03CR) 10Jean-Frédéric: [C: 032] "No, I was just assuming you would have been added automatically. :) Thanks for the review!" [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/299891 (https://phabricator.wikimedia.org/T140795) (owner: 10Jean-Frédéric)
[11:13:02] <grrrit-wm>	 (03Merged) 10jenkins-bot: Harvest Wikidata item for Canada in English ca_(en) [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/299891 (https://phabricator.wikimedia.org/T140795) (owner: 10Jean-Frédéric)
[11:59:06] <wikibugs>	 06Labs, 10Beta-Cluster-Infrastructure, 07Blocked-on-Operations, 13Patch-For-Review, 07Puppet: /etc/puppet/puppet.conf keeps getting double content - first for labs-wide puppetmaster, then for the correct puppetmaster - https://phabricator.wikimedia.org/T132689#2206880 (10AlexMonk-WMF) I saw it just a few...
[12:55:59] <wikibugs>	 06Labs, 10Tool-Labs, 10MediaWiki-Interwiki, 10MediaWiki-extensions-Interwiki, 10Wikimedia-Interwiki-links: Toollabs interwiki doesn't work when url parameters needed - https://phabricator.wikimedia.org/T140981#2483286 (10Dvorapa)
[13:22:48] <wikibugs>	 06Labs, 10Labs-Infrastructure, 10Shinken, 13Patch-For-Review: labmon1001 renderer requires authentication breaking labs Shinken probes - https://phabricator.wikimedia.org/T140976#2483339 (10yuvipanda) a:03yuvipanda
[14:16:43] <travis-ci>	 wikimedia/nagf#49 (master - 3bcf213: Yuvi Panda) The build passed. - https://travis-ci.org/wikimedia/nagf/builds/146388742
[14:17:27] <YuviPanda>	 Krinkle ^ jfyi (I merged it directly) to update graphite URL (old one continues to redirect appropriately)
[14:31:58] <wikibugs>	 06Labs, 10Labs-Infrastructure: Recapture unused floating IPs - https://phabricator.wikimedia.org/T140985#2483421 (10Andrew)
[14:32:19] <wikibugs>	 06Labs, 10Labs-Infrastructure: Recapture unused floating IPs - https://phabricator.wikimedia.org/T140985#2483421 (10yuvipanda) The tools project uses 50 of these, need to revisit and kill.
[14:33:57] <wikibugs>	 06Labs, 06Operations, 13Patch-For-Review: Move labs graphite to graphite-labs.wikimedia.org - https://phabricator.wikimedia.org/T140899#2483436 (10yuvipanda) 05Open>03Resolved a:03yuvipanda Done and left redirects in place!
[14:35:55] <wikibugs>	 06Labs, 10Labs-Infrastructure, 10Shinken, 13Patch-For-Review: labmon1001 renderer requires authentication breaking labs Shinken probes - https://phabricator.wikimedia.org/T140976#2483439 (10yuvipanda) 05Open>03Resolved Thanks for reporting :D is sorted out now.
[14:36:37] <wikibugs>	 06Labs, 07Graphite, 13Patch-For-Review: Setup "official labs grafana" instance - https://phabricator.wikimedia.org/T120295#2483441 (10yuvipanda) Is setup to allow login to https://grafana-labs-admin.wikimedia.org to anyone who is a member of any labs project. Still can't see the dashboards I save there in ht...
[14:44:00] <wikibugs>	 06Labs, 10Tool-Labs: Webservice on Tools Labs fails repeatedly - https://phabricator.wikimedia.org/T115231#2483454 (10russblau) It is currently down again. Shell shows the following:   ``` tools.dplbot@tools-bastion-03:~$ kubectl get pod NAME                      READY     STATUS    RESTARTS   AGE dplbot-14457...
[14:45:09] <wikibugs>	 06Labs, 10Tool-Labs: Webservice on Tools Labs fails repeatedly - https://phabricator.wikimedia.org/T115231#2483468 (10yuvipanda) Don't, am looking at it just now.
[14:47:06] <grrrit-wm>	 (03CR) 10Lokal Profil: [C: 032] Add two known fields to fr (fr) [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/299886 (owner: 10Jean-Frédéric)
[14:47:52] <grrrit-wm>	 (03Merged) 10jenkins-bot: Add two known fields to fr (fr) [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/299886 (owner: 10Jean-Frédéric)
[14:48:33] <wikibugs>	 06Labs, 10Tool-Labs: Webservice on Tools Labs fails repeatedly - https://phabricator.wikimedia.org/T115231#2483472 (10yuvipanda) Hmm, I fixed it (required a restart of kube2proxy layer). I'll file a separate bug to investigate this just now.
[14:50:55] <wikibugs>	 06Labs, 10Labs-Kubernetes, 10Tool-Labs: Monitor kube2proxy failures - https://phabricator.wikimedia.org/T140988#2483474 (10yuvipanda)
[15:06:03] <grrrit-wm>	 (03CR) 10Lokal Profil: [C: 032] Harvest Wikidata item in gb-eng (en) [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/299890 (https://phabricator.wikimedia.org/T140795) (owner: 10Jean-Frédéric)
[15:06:55] <grrrit-wm>	 (03Merged) 10jenkins-bot: Harvest Wikidata item in gb-eng (en) [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/299890 (https://phabricator.wikimedia.org/T140795) (owner: 10Jean-Frédéric)
[15:23:51] <wikibugs>	 10Wikibugs, 13Patch-For-Review: Do not notify #Trash tasks to IRC - https://phabricator.wikimedia.org/T140426#2464291 (10Luke081515) Alternatvily, we can [[https://phabricator.wikimedia.org/project/manage/89/#24407 change the name back]], before the rename it worked.
[15:27:10] <wikibugs>	 10Wikibugs, 13Patch-For-Review: Do not notify #Trash tasks to IRC - https://phabricator.wikimedia.org/T140426#2464291 (10Dzahn) We just gotta watch out we are not creating more meta notifications about not creating notifications than by just ..not touching all the trash tasks.
[15:51:08] <wikibugs>	 06Labs, 10Horizon, 13Patch-For-Review: Investigate (and probably disable) 'rebuild instance' option - https://phabricator.wikimedia.org/T140259#2483686 (10Andrew) 05Open>03Resolved
[16:12:53] <wikibugs>	 06Labs, 10Tool-Labs: Write diamond collector for gridengine job count stats - https://phabricator.wikimedia.org/T140999#2483810 (10yuvipanda)
[16:13:49] <shinken-wm>	 PROBLEM - Puppet run on tools-services-02 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0]
[16:14:20] <wikibugs>	 06Labs, 10Tool-Labs, 06Community-Tech-Tool-Labs: Deploy "Striker" Tool Labs console to WMF production - https://phabricator.wikimedia.org/T136256#2483827 (10bd808) Gerrit repos for `labs/striker` and `labs/striker/wheels` [[https://www.mediawiki.org/w/index.php?title=Git%2FNew_repositories%2FRequests%2FEntri...
[16:19:58] <Krinkle>	 YuviPanda: okay, thanks
[16:40:10] <wikibugs>	 06Labs, 10Tool-Labs, 06Community-Tech-Tool-Labs, 10Striker: Deploy "Striker" Tool Labs console to WMF production - https://phabricator.wikimedia.org/T136256#2483939 (10bd808)
[16:40:23] <wikibugs>	 06Labs, 10Tool-Labs, 06Community-Tech-Tool-Labs, 10Diffusion, and 2 others: Create application to manage Diffusion repositories for a Tool Labs project - https://phabricator.wikimedia.org/T133252#2483940 (10bd808)
[16:40:34] <wikibugs>	 06Labs, 10Tool-Labs, 06Community-Tech-Tool-Labs, 10Security-Reviews, 10Striker: Security review of Tool Labs console application - https://phabricator.wikimedia.org/T135784#2483942 (10bd808)
[16:41:12] <zhuyifei1999_>	 andrewbogott: there?
[16:41:19] <andrewbogott>	 what's up?
[16:42:05] <wikibugs>	 06Labs, 10Labs-Infrastructure, 10DBA, 10Striker: Investigate moving labsdb (replicas) user credential management to 'Striker' (codename) - https://phabricator.wikimedia.org/T140832#2483968 (10bd808)
[16:42:07] <andrewbogott>	 zhuyifei1999_: ?
[16:43:01] <zhuyifei1999_>	 I haven't monitored the tool closely this week, but the nagf metrics for encoding02 looks weird
[16:43:12] <zhuyifei1999_>	 http://tools.wmflabs.org/nagf/?project=video
[16:44:53] <zhuyifei1999_>	 (mysterious system cpu usage what doesn't decrease total cpu usage, nice cpu using twice as shows (are the ghost ones the mysterious "stolen" cpu?))
[16:44:59] <zhuyifei1999_>	 etc
[16:45:11] <zhuyifei1999_>	 is that a bug or something?
[16:47:04] <zhuyifei1999_>	 andrewbogott: 
[16:47:20] <andrewbogott>	 I'm looking but I really don't understand the question
[16:47:35] <andrewbogott>	 it looks like a normal CPU graph to me… mostly idle, briefly busy while processing a job, back to idle
[16:47:57] <zhuyifei1999_>	 http://graphite-labs.wikimedia.org/render/?title=encoding02+CPU+last+week&width=800&height=250&from=-1week&hideLegend=false&uniqueLegend=true&target=alias%28color%28stacked%28video.encoding02.cpu.total.user%29%2C%22%233333bb%22%29%2C%22User%22%29&target=alias%28color%28stacked%28video.encoding02.cpu.total.nice%29%2C%22%23ffea00%22%29%2C%22Nice%22%29&target=a
[16:47:57] <zhuyifei1999_>	 lias%28color%28stacked%28video.encoding02.cpu.total.system%29%2C%22%23dd0000%22%29%2C%22System%22%29&target=alias%28color%28stacked%28video.encoding02.cpu.total.iowait%29%2C%22%23ff8a60%22%29%2C%22Wait+I%2FO%22%29&target=alias%28alpha%28color%28stacked%28video.encoding02.cpu.total.idle%29%2C%22%23e2e2f2%22%29%2C0.4%29%2C%22Idle%22%29
[16:48:03] <zhuyifei1999_>	 the weekly one
[16:48:50] <shinken-wm>	 RECOVERY - Puppet run on tools-services-02 is OK: OK: Less than 1.00% above the threshold [0.0]
[16:49:49] <andrewbogott>	 zhuyifei1999_: so, I've never looked at a nagf graph before.  But it still looks normal to me, I think I don't understand what you're asking :(
[16:50:05] <zhuyifei1999_>	 ...
[16:50:44] <zhuyifei1999_>	 see the part near 7/17 line and 7/21 line
[16:50:59] <zhuyifei1999_>	 notice the idle cpu
[16:51:12] <andrewbogott>	 you mean where 'idle' goes about 100% for a bit?
[16:51:46] <andrewbogott>	 um… s/about/above/
[16:52:03] <zhuyifei1999_>	 at 7/17 line it's 0% with the rest at about 50%, so 50% missing
[16:53:09] <zhuyifei1999_>	 at 7/21 it's about 100% with system cpu at about 7%. so about 107% total
[16:53:29] <zhuyifei1999_>	 and It doesn't seem to make sense
[16:53:37] <andrewbogott>	 at 7/17 I see see 'nice' at about 50% and idle at about 50%
[16:54:56] <zhuyifei1999_>	 uh, its it generating different graphs?
[16:55:20] <zhuyifei1999_>	 hmm weird, refreshed and it seems fine
[16:55:22] <andrewbogott>	 maybe :(
[16:56:00] <zhuyifei1999_>	 uh refreshed again and it's back
[16:57:12] <zhuyifei1999_>	 https://usercontent.irccloud-cdn.com/file/IK3IS2Ep/Screenshot%20from%202016-07-22%2000%3A56%3A20.png
[16:58:55] <andrewbogott>	 My graph doesn't look like that :(  But yeah, it's obviously not subtracting System CPU from idle
[16:58:57] <zhuyifei1999_>	 (doesn't seem to be reproduceable again)
[16:59:01] <andrewbogott>	 looks like a misplaced +/- in the code
[16:59:16] <andrewbogott>	 that or it's doing something clever and allocation an additional vcpu for system stuff, but that seems less likely
[17:00:51] <zhuyifei1999_>	 yeah, the total is staying at 100% now, and refreshes aren't showing that buggy graph again
[17:01:53] <zhuyifei1999_>	 sorry if that wasted some of your time
[17:03:45] <wikibugs>	 06Labs, 10Deployment-Systems, 10wikitech.wikimedia.org: /etc/mediawiki/WikitechPrivateSettings.php not found on tin - https://phabricator.wikimedia.org/T139917#2484108 (10Dereckson)
[17:04:40] * zhuyifei1999_ still wonders where that system cpu usage data comes from
[17:05:46] <wikibugs>	 06Labs, 10wikitech.wikimedia.org, 07Upstream, 07Wikimedia-log-errors: PHP array to string conversion on wikitech in SMW 1.8.x - https://phabricator.wikimedia.org/T124235#2484147 (10Dereckson)
[17:05:48] <wikibugs>	 06Labs, 10wikitech.wikimedia.org: Upgrade SMW to 1.9 or later - https://phabricator.wikimedia.org/T62886#2484146 (10Dereckson)
[17:06:46] <wikibugs>	 06Labs, 10wikitech.wikimedia.org, 07Wikimedia-log-errors: PHP array to string conversion on wikitech in SMW 1.8.x - https://phabricator.wikimedia.org/T124235#1949771 (10Dereckson) >>! @kghbln  closed the issue upstream and stated: > Thanks for reporting. This indeed was an issue with SMW 1.8.x and MediaWiki...
[17:09:44] <wikibugs>	 06Labs, 10Labs-Kubernetes, 10Tool-Labs: etcd hosts hanging with kernel hang - https://phabricator.wikimedia.org/T140256#2484199 (10yuvipanda) p:05Triage>03High tools-worker-1018 hit this just now.
[17:14:54] <andrewbogott>	 zhuyifei1999_: no problem, it's weird
[17:15:17] * tom29739 wonders where it comes from too
[17:27:31] <wikibugs>	 06Labs, 10Labs-Kubernetes, 10Tool-Labs, 13Patch-For-Review: Add a diamond collector for kubernetes usage stats - https://phabricator.wikimedia.org/T140887#2484331 (10yuvipanda) 05Open>03Resolved We also have prometheues.
[17:30:05] <wikibugs>	 06Labs, 10Labs-Kubernetes, 10Tool-Labs: Kubernetes worker nodes hanging - https://phabricator.wikimedia.org/T141017#2484363 (10yuvipanda)
[17:37:42] <wikibugs>	 06Labs, 10Labs-Kubernetes, 10Tool-Labs: Kubernetes worker nodes hanging - https://phabricator.wikimedia.org/T141017#2484400 (10yuvipanda)
[17:38:03] <wikibugs>	 06Labs, 10Labs-Kubernetes, 10Tool-Labs: etcd hosts hanging with kernel hang - https://phabricator.wikimedia.org/T140256#2484403 (10yuvipanda) Split off tracking worker nodes into T141017
[17:38:39] <wikibugs>	 06Labs, 10Labs-Kubernetes, 10Tool-Labs: etcd hosts hanging with kernel hang - https://phabricator.wikimedia.org/T140256#2484405 (10yuvipanda) I'm upgrading all etcd hosts to newer kernel (4.4)
[17:44:33] <shinken-wm>	 PROBLEM - Puppet run on tools-k8s-etcd-01 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [0.0]
[17:48:15] <shinken-wm>	 PROBLEM - Puppet run on tools-k8s-etcd-02 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [0.0]
[17:53:18] <shinken-wm>	 RECOVERY - Puppet run on tools-k8s-etcd-02 is OK: OK: Less than 1.00% above the threshold [0.0]
[17:54:34] <shinken-wm>	 RECOVERY - Puppet run on tools-k8s-etcd-01 is OK: OK: Less than 1.00% above the threshold [0.0]
[17:58:37] <wikibugs>	 06Labs, 10Labs-Kubernetes, 10Tool-Labs: Kubernetes worker nodes hanging - https://phabricator.wikimedia.org/T141017#2484507 (10yuvipanda)
[18:03:18] <shinken-wm>	 PROBLEM - Puppet run on tools-flannel-etcd-02 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [0.0]
[18:13:17] <shinken-wm>	 RECOVERY - Puppet run on tools-flannel-etcd-02 is OK: OK: Less than 1.00% above the threshold [0.0]
[18:15:24] <wikibugs>	 06Labs, 10Tool-Labs: Normalize kernel on all Debian Jessie nodes in tools - https://phabricator.wikimedia.org/T140611#2484619 (10yuvipanda) Done for:  1. tools-k8s-etcd-0[1-3] 2. tools-flannel-etcd-0[1-3] 3. tools-k8s-master-01 4. tools-docker-registry-01
[18:16:11] <wikibugs>	 06Labs, 10Labs-Kubernetes, 10Tool-Labs: Create failover host for docker registry - https://phabricator.wikimedia.org/T141030#2484623 (10yuvipanda)
[18:18:18] <wikibugs>	 06Labs, 10Labs-Kubernetes, 10Tool-Labs: Create failover host for docker registry - https://phabricator.wikimedia.org/T141030#2484636 (10yuvipanda) This is made difficult by the fact that:  1. We don't have a Swift cluster 2. Docker's registry model requires that the name of the host be present in the tag, so...
[19:02:36] <wikibugs>	 06Labs, 10wikitech.wikimedia.org, 13Patch-For-Review: Install wikitech private settings directly onto wikitech hosts - https://phabricator.wikimedia.org/T124732#1964569 (10Dereckson) Side effect of this solution: T140889
[19:37:04] <shinken-wm>	 PROBLEM - High iowait on tools-worker-1018 is CRITICAL: CRITICAL: tools.tools-worker-1018.cpu.total.iowait (>100.00%)
[19:40:21] <YuviPanda>	 chasemp ^ seems to work
[19:40:50] <chasemp>	 that's kind of a misnomer since it can't actually check it? or is that node submitting iowait numbers?
[19:40:57] <chasemp>	 that's the dead node right?
[19:43:12] <YuviPanda>	 chasemp it is submitting iowait numbers
[19:43:28] <chasemp>	 well that's super interesting
[19:43:49] <YuviPanda>	 chasemp  yeah. when tools-worker-1004 died for example, you can see that it was sending no metrics
[19:43:51] <YuviPanda>	 not the case here
[19:44:30] <chasemp>	 it died off for iostat tho
[19:44:30] <chasemp>	 https://graphite-labs.wikimedia.org/render/?width=1051&height=474&_salt=1469130163.848&target=tools.tools-worker-1018.iostat.dm-0.await&hideLengend=false&from=-48h
[19:45:12] <chasemp>	 https://graphite-labs.wikimedia.org/render/?width=1051&height=474&_salt=1469130163.848&target=tools.tools-worker-1018.iostat.dm-0.await&hideLengend=false&from=-48h&lineMode=connected
[19:45:24] <chasemp>	 do we know when it went offline YuviPanda^
[19:45:30] <chasemp>	 I wonder if aligns w/ iostat death
[19:45:31] <YuviPanda>	 nope
[19:45:34] <chasemp>	 iostat collection even
[19:45:38] <YuviPanda>	 I usually look at kern.log after I reboot
[19:45:44] <chasemp>	 dollars to donuts that lines up
[19:46:06] <YuviPanda>	 woah I actually got a shell on it
[19:46:08] <YuviPanda>	 lost it again
[19:46:25] <chasemp>	 hm
[19:46:31] <YuviPanda>	 (i had it opened a long time ago and it had actually gone through)
[19:46:42] <YuviPanda>	 The last Puppet run was at Thu Jul 21 10:43:51 UTC 2016 (542 minutes ago). 
[19:46:47] <YuviPanda>	 I get to there
[19:46:52] <YuviPanda>	 chasemp I have a shell
[19:46:55] <chasemp>	 what's even more odd is
[19:47:01] <chasemp>	 load isn't increasing
[19:47:02] <YuviPanda>	 if you go in as root and when it 'hangs' if you ctrl-c you get a shell
[19:47:04] <chasemp>	 if it really has things in d-wait
[19:47:08] <chasemp>	 hm
[19:47:14] <chasemp>	 I never get that far when I've tried
[19:47:16] <chasemp>	 thus far
[19:47:18] <YuviPanda>	 and then I run 'top' and it hangs
[19:47:29] <chasemp>	 ssh root@tools-worker-1018.eqiad.wmflabs
[19:47:29] <chasemp>	 ?
[19:47:39] <YuviPanda>	 yeah
[19:47:47] <YuviPanda>	 and basically it enter a few times and then ctrl-c a few times
[19:48:06] <chasemp>	 no dice for me so far
[19:48:08] <chasemp>	 but
[19:48:24] <YuviPanda>	 yeah, a ps auxf hung my kernel
[19:48:25] <YuviPanda>	 err, shell
[19:48:26] <chasemp>	 iotop?
[19:49:13] <chasemp>	 pidstat -l -d
[19:49:14] <chasemp>	 hm
[19:49:14] <YuviPanda>	 trying to get a new shell now
[19:49:29] <chasemp>	 can you fork it and try to dump to a file?
[19:49:37] <chasemp>	 god we desperately need console access
[19:50:07] <YuviPanda>	 pidstat -l -d hung
[19:50:17] <YuviPanda>	 chasemp what do you mean by 'fork it and try to dump to a file'
[19:50:49] <YuviPanda>	 looks like I can run bash builtins and nothing else
[19:50:56] <YuviPanda>	 cat works
[19:51:03] <YuviPanda>	 (I have to go soon though)
[19:51:07] <chasemp>	 k
[19:51:42] <YuviPanda>	 chasemp I got a stack trace in dmesg!
[19:51:48] <chasemp>	 nice
[19:52:00] <chasemp>	 paste it?
[19:52:23] <wikibugs>	 06Labs, 10Labs-Kubernetes, 10Tool-Labs, 13Patch-For-Review: Kubernetes worker nodes hanging - https://phabricator.wikimedia.org/T141017#2484932 (10yuvipanda) Found this stack trace in dmesg for tools-worker-1018:  ``` [Thu Jul 21 11:10:11 2016] INFO: task jbd2/vda3-8:143 blocked for more than 120 seconds....
[19:52:23] <YuviPanda>	 chasemp https://phabricator.wikimedia.org/T141017
[19:53:19] <YuviPanda>	 chasemp I saved dmesg into /srv/dmesglogs as well
[19:53:25] <chasemp>	 k
[19:53:35] <chasemp>	 I don't see anything illuminating there yet
[19:54:13] <YuviPanda>	 [Thu Jul 21 11:06:11 2016] INFO: task jbd2/vda3-8:143 blocked for more than 120 seconds.
[19:54:14] <wikibugs>	 06Labs, 10Labs-Kubernetes, 10Tool-Labs, 13Patch-For-Review: Kubernetes worker nodes hanging - https://phabricator.wikimedia.org/T141017#2484933 (10yuvipanda) First report of the error is at   ``` [Thu Jul 21 11:06:11 2016] INFO: task jbd2/vda3-8:143 blocked for more than 120 seconds. ```
[19:54:14] <YuviPanda>	 first report
[19:54:39] <chasemp>	 what size are these nodes?
[19:54:42] <chasemp>	 xlarge?
[19:55:46] <YuviPanda>	 chasemp large
[19:55:56] <YuviPanda>	 chasemp etcd ones are small
[20:00:03] <YuviPanda>	 chasemp I munged around in /proc and found list of running processes. these include ones that should've been killed when I depooled it
[20:00:11] <d3r1ck>	 Hello all :)
[20:00:16] * d3r1ck checks in
[20:01:19] <chasemp>	 YuviPanda: pidstat -l -d
[20:01:23] <chasemp>	 does that run?
[20:01:31] <YuviPanda>	 nope
[20:01:50] <chasemp>	 well that sucks
[20:01:50] <YuviPanda>	 all of my shells also hang after a few minutes
[20:01:52] <YuviPanda>	 and I open a new on
[20:01:53] <YuviPanda>	 e
[20:01:57] <YuviPanda>	 I got xargs to work
[20:02:07] <chasemp>	 NOHUP pidstat -l -d &
[20:02:36] <YuviPanda>	 -bash: NOHUP: command not found
[20:02:40] <YuviPanda>	 ?
[20:02:55] <chasemp>	 is it case sensitive :)
[20:02:55] <YuviPanda>	 bah all small
[20:02:58] <chasemp>	 sheepishly I suggest
[20:02:59] <YuviPanda>	 but now my terminal is gone again
[20:03:13] <chasemp>	 nohup pidstat -l -d > /tmp/foo &
[20:03:14] <chasemp>	 idk
[20:03:15] <chasemp>	 worht a shot
[20:03:19] <chasemp>	 worth a shot
[20:03:50] <YuviPanda>	 there's a foo in /srv
[20:03:51] <YuviPanda>	 but is empty
[20:03:59] <YuviPanda>	 and I gotta go...
[20:04:32] <chasemp>	 k
[20:04:34] <chasemp>	 good travels
[20:05:04] <YuviPanda>	 ty
[20:34:13] <wikibugs>	 10Labs-Kubernetes: Odd kubernetes error - https://phabricator.wikimedia.org/T141041#2485077 (10Magnus)
[21:48:11] <wikibugs>	 06Labs, 10Labs-Kubernetes, 10Tool-Labs, 13Patch-For-Review: Kubernetes worker nodes hanging - https://phabricator.wikimedia.org/T141017#2484363 (10hashar) jbd2_journal_commit_transaction  and overall diskio borked somehow.  We had that for a while on CI slaves end of June.  A console trace is P3278 and tas...
[22:42:27] <chasemp>	 !log tools reboot tools-worker-1018 as stuck T141017
[22:42:28] <stashbot>	 T141017: Kubernetes worker nodes hanging - https://phabricator.wikimedia.org/T141017
[22:42:31] <labs-morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master
[22:48:51] <shinken-wm>	 PROBLEM - Puppet staleness on tools-worker-1018 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [43200.0]
[22:48:57] <wikibugs>	 06Labs, 10Labs-Kubernetes, 10Tool-Labs, 13Patch-For-Review: Kubernetes worker nodes hanging - https://phabricator.wikimedia.org/T141017#2485710 (10chasemp)
[22:57:01] <shinken-wm>	 RECOVERY - High iowait on tools-worker-1018 is OK: OK: All targets OK
[22:58:51] <shinken-wm>	 RECOVERY - Puppet staleness on tools-worker-1018 is OK: OK: Less than 1.00% above the threshold [3600.0]