[05:25:27] PROBLEM - Puppet staleness on tools-webgrid-lighttpd-1414 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [43200.0] [07:04:39] PROBLEM - Puppet run on tools-docker-builder-03 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [09:40:26] 10Striker, 10Phabricator, 10Security-Reviews, 13Patch-For-Review: Unable to mirror repository from git.legoktm.com into diffusion - https://phabricator.wikimedia.org/T143969#2647761 (10faidon) I don't have any strong feelings towards either direction, no. (let's see if Moritz or Darian feel otherwise) As... [11:14:50] PROBLEM - Puppet run on tools-services-02 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [11:54:51] RECOVERY - Puppet run on tools-services-02 is OK: OK: Less than 1.00% above the threshold [0.0] [12:41:55] PROBLEM - SSH on tools-webgrid-lighttpd-1210 is CRITICAL: Server answer [13:26:36] (03PS1) 10Tobias Gritschacher: Add 2ColConflict and ElectronPdfService extensions [labs/tools/grrrit] - 10https://gerrit.wikimedia.org/r/311414 [13:31:30] 06Labs, 07Tracking: New Labs project requests (tracking) - https://phabricator.wikimedia.org/T76375#2648137 (10chasemp) [13:31:32] 06Labs, 15User-Nikerabbit: Request creation of wmwcourse labs project - https://phabricator.wikimedia.org/T144388#2648135 (10chasemp) 05Open>03stalled >>! In T144388#2629978, @Nikerabbit wrote: > Yep. I said early 2017 in case the project work takes more time to finish after the lectures end in December.... [13:31:48] 06Labs, 15User-Nikerabbit: Revert in 01/2017: Request creation of wmwcourse labs project - https://phabricator.wikimedia.org/T144388#2648138 (10chasemp) [13:34:46] 06Labs, 05Goal: Create labtest cluster - https://phabricator.wikimedia.org/T120293#2648147 (10chasemp) [13:34:48] 06Labs: Install and configure labtestnet2001 as a labnet gateway - https://phabricator.wikimedia.org/T120297#2648145 (10chasemp) 05Open>03Resolved a:03chasemp [13:48:52] 06Labs, 10Labs-Infrastructure: New instance first puppet run is broken - https://phabricator.wikimedia.org/T144330#2648217 (10chasemp) 05Open>03Resolved a:03chasemp Yes, I believe so. [14:03:14] hey yuvipanda! coudl you add me to thelolrrit-wm tool please? :D [14:05:43] RECOVERY - Host tools-secgroup-test-103 is UP: PING OK - Packet loss = 0%, RTA = 0.63 ms [14:06:00] (03CR) 10Addshore: [C: 031] "Looks like I can't deploy this after I merge yet so just +1 for now." [labs/tools/grrrit] - 10https://gerrit.wikimedia.org/r/311414 (owner: 10Tobias Gritschacher) [14:13:20] PROBLEM - Host tools-secgroup-test-103 is DOWN: CRITICAL - Host Unreachable (10.68.21.22) [14:27:01] RECOVERY - Host secgroup-lag-102 is UP: PING OK - Packet loss = 0%, RTA = 0.75 ms [14:51:59] PROBLEM - Host secgroup-lag-102 is DOWN: CRITICAL - Host Unreachable (10.68.17.218) [14:54:09] (03CR) 10Paladox: "@Addshore hi, if you can merge I can deploy for you?" [labs/tools/grrrit] - 10https://gerrit.wikimedia.org/r/311414 (owner: 10Tobias Gritschacher) [14:54:23] (03CR) 10Addshore: [C: 032] Add 2ColConflict and ElectronPdfService extensions [labs/tools/grrrit] - 10https://gerrit.wikimedia.org/r/311414 (owner: 10Tobias Gritschacher) [14:55:01] (03Merged) 10jenkins-bot: Add 2ColConflict and ElectronPdfService extensions [labs/tools/grrrit] - 10https://gerrit.wikimedia.org/r/311414 (owner: 10Tobias Gritschacher) [14:56:17] (03CR) 10Paladox: "@Addshore hi, would you also be able to merge this one please? This is all tested and has been deployed." [labs/tools/grrrit] - 10https://gerrit.wikimedia.org/r/308949 (https://phabricator.wikimedia.org/T93082) (owner: 10Paladox) [14:56:41] 06Labs: Request creation of Mathematical Refresh Rate Policies labs project - https://phabricator.wikimedia.org/T143901#2582592 (10Andrew) Hello! I'm sorry that this request hasn't been acknowledged. If you would still like the project created, can you tell us more about what will run inside the project (and,... [14:57:26] RECOVERY - Host tools-secgroup-test-102 is UP: PING OK - Packet loss = 0%, RTA = 0.64 ms [14:58:07] (03PS7) 10Paladox: Do not show merges by the L10n-bot [labs/tools/grrrit] - 10https://gerrit.wikimedia.org/r/308949 (https://phabricator.wikimedia.org/T93082) [14:58:43] (03CR) 10Addshore: Do not show merges by the L10n-bot (031 comment) [labs/tools/grrrit] - 10https://gerrit.wikimedia.org/r/308949 (https://phabricator.wikimedia.org/T93082) (owner: 10Paladox) [14:59:35] (03CR) 10Paladox: Do not show merges by the L10n-bot (031 comment) [labs/tools/grrrit] - 10https://gerrit.wikimedia.org/r/308949 (https://phabricator.wikimedia.org/T93082) (owner: 10Paladox) [14:59:50] 06Labs, 10Labs-Infrastructure: Experiment with Linux KSM (dedupe memory shared by instances) on labs infra - https://phabricator.wikimedia.org/T146037#2648431 (10hashar) [15:00:54] (03PS8) 10Paladox: Do not show merges by the L10n-bot [labs/tools/grrrit] - 10https://gerrit.wikimedia.org/r/308949 (https://phabricator.wikimedia.org/T93082) [15:01:17] (03CR) 10jenkins-bot: [V: 04-1] Do not show merges by the L10n-bot [labs/tools/grrrit] - 10https://gerrit.wikimedia.org/r/308949 (https://phabricator.wikimedia.org/T93082) (owner: 10Paladox) [15:01:53] (03PS9) 10Paladox: Do not show merges by the L10n-bot [labs/tools/grrrit] - 10https://gerrit.wikimedia.org/r/308949 (https://phabricator.wikimedia.org/T93082) [15:02:53] (03CR) 10Paladox: "@Addshore done" [labs/tools/grrrit] - 10https://gerrit.wikimedia.org/r/308949 (https://phabricator.wikimedia.org/T93082) (owner: 10Paladox) [15:04:29] PROBLEM - Host tools-secgroup-test-102 is DOWN: CRITICAL - Host Unreachable (10.68.21.170) [15:04:35] (03CR) 10Addshore: Do not show merges by the L10n-bot (031 comment) [labs/tools/grrrit] - 10https://gerrit.wikimedia.org/r/308949 (https://phabricator.wikimedia.org/T93082) (owner: 10Paladox) [15:05:43] (03PS10) 10Paladox: Do not show merges by the L10n-bot [labs/tools/grrrit] - 10https://gerrit.wikimedia.org/r/308949 (https://phabricator.wikimedia.org/T93082) [15:05:57] (03CR) 10Paladox: "@Addshore done :)" (031 comment) [labs/tools/grrrit] - 10https://gerrit.wikimedia.org/r/308949 (https://phabricator.wikimedia.org/T93082) (owner: 10Paladox) [15:06:25] (03CR) 10Addshore: [C: 032] Do not show merges by the L10n-bot [labs/tools/grrrit] - 10https://gerrit.wikimedia.org/r/308949 (https://phabricator.wikimedia.org/T93082) (owner: 10Paladox) [15:06:56] (03Merged) 10jenkins-bot: Do not show merges by the L10n-bot [labs/tools/grrrit] - 10https://gerrit.wikimedia.org/r/308949 (https://phabricator.wikimedia.org/T93082) (owner: 10Paladox) [15:07:21] !log tools.lolrrit-wm deploying https://gerrit.wikimedia.org/r/311414 and https://gerrit.wikimedia.org/r/#/c/308949/ [15:07:25] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.lolrrit-wm/SAL, Master [15:14:11] (03CR) 10Paladox: "Thanks." [labs/tools/grrrit] - 10https://gerrit.wikimedia.org/r/308949 (https://phabricator.wikimedia.org/T93082) (owner: 10Paladox) [15:14:23] addshore ^^ deployed :) [15:14:29] thanks! [15:14:35] Your welcome :) [15:16:35] 06Labs: Request creation of Mathematical Refresh Rate Policies labs project - https://phabricator.wikimedia.org/T143901#2648499 (10chasemp) p:05Triage>03Normal @Agaherbert can you expand a bit on what you need here? We can allocate resources but from what I read and understand this seems like the kind of an... [15:19:51] Hello everyone, db1069 was hit again by this: https://phabricator.wikimedia.org/T145077 - I have collected all the information again and will feed it back to Percona and TokuDB so they can investigate further [15:20:02] db1069 is now fixed and trying to catch up [15:21:12] https://tools.wmflabs.org/replag/ s2 lag theris du e [15:21:19] *there is due to this issue [15:53:50] 10Tool-Labs-tools-Pageviews: Add "wiki page" as a source to Massviews - https://phabricator.wikimedia.org/T144251#2648701 (10MusikAnimal) 05Open>03Resolved a:03MusikAnimal Done with https://github.com/MusikAnimal/pageviews/releases/tag/2016.09.19T15.48 [15:54:09] 10Tool-Labs-tools-Pageviews: Add "subpages" as a source to Massviews - https://phabricator.wikimedia.org/T144238#2648709 (10MusikAnimal) 05Open>03Resolved a:03MusikAnimal Done with https://github.com/MusikAnimal/pageviews/releases/tag/2016.09.19T15.48 [15:57:39] 06Labs, 10Labs-Infrastructure, 10Analytics: Report page views for labs instances - https://phabricator.wikimedia.org/T103726#2648753 (10Milimetric) This should be done with piwik on labs, now that we have more experience with it. [16:01:18] 10Quarry, 10Analytics: it would be useful to run the same Quarry query conveniently in several database - https://phabricator.wikimedia.org/T95582#1195035 (10Milimetric) I'm going to untag Analytics, quarry is a different approach, we're about to allow multi-database data access in a different way. [16:01:25] 10Quarry: it would be useful to run the same Quarry query conveniently in several database - https://phabricator.wikimedia.org/T95582#2648793 (10Milimetric) [16:36:13] 10Striker, 10Phabricator, 10Security-Reviews, 13Patch-For-Review: Unable to mirror repository from git.legoktm.com into diffusion - https://phabricator.wikimedia.org/T143969#2648991 (10mmodell) >>! In T143969#2647761, @faidon wrote: > Under which account do those git fetches run, and what other privileges... [17:04:17] 06Labs, 10Beta-Cluster-Infrastructure: Please raise quota for deployment-prep - https://phabricator.wikimedia.org/T145611#2635940 (10Andrew) This increase sounds fine to me. [17:04:25] 06Labs, 10Beta-Cluster-Infrastructure: Request increased quota for deployment-prep labs project - https://phabricator.wikimedia.org/T145636#2636577 (10Andrew) Yep, increase is fine with me. [17:10:11] 10Striker: Allow easy replication of existing github/bitbucket repos - https://phabricator.wikimedia.org/T143971#2649139 (10mmodell) [17:10:15] 10Striker, 10Phabricator, 10Security-Reviews, 13Patch-For-Review: Unable to mirror repository from git.legoktm.com into diffusion - https://phabricator.wikimedia.org/T143969#2649140 (10mmodell) [17:37:24] 06Labs, 10PAWS, 06Research-and-Data: Setup new labsdbs for PAWS / Quarry - https://phabricator.wikimedia.org/T146061#2649322 (10yuvipanda) [17:38:12] 06Labs, 10PAWS, 06Operations, 06Research-and-Data, 10hardware-requests: Purchase new labsdbs for PAWS / Quarry - https://phabricator.wikimedia.org/T146061#2649334 (10chasemp) p:05Triage>03Normal [17:40:39] yuvipanda: Do you happen to know if the Commons database server is unhappy? Some of my queries that used to take less than a minute started to time out [17:42:42] multichill, can you elaborate? [17:45:13] Hey jynus, didn't realize it was you. I do some horrible queries to find a bunch of images. For example /data/project/multichill/queries/commons/paintings_without_wikidata_ci.sql [17:45:26] no [17:45:40] I mean, what problems are you finding? [17:45:47] This query now timed out. RROR 2013 (HY000) at line 6: Lost connection to MySQL server during query after 2 hours [17:45:54] It used to complete in much shorter times [17:45:58] yes, long-running queries do that [17:46:01] Let me dig up the log [17:46:02] so [17:46:07] 10Tool-Labs-tools-Xtools: Convert all xtools issues to either Phabricator or GitHub - https://phabricator.wikimedia.org/T134632#2649396 (10Matthewrbowker) I have added the xtools repository to Phabricator. See {rXT} [17:46:12] the issue is why they are taking so much time [17:46:45] 06Labs, 06Operations, 06Research-and-Data, 10hardware-requests: eqiad: 2 hardware access request for research labsdbs - https://phabricator.wikimedia.org/T146065#2649413 (10yuvipanda) [17:46:59] 06Labs, 06Operations, 06Research-and-Data, 10hardware-requests: eqiad: 2 hardware access request for research labsdbs - https://phabricator.wikimedia.org/T146065#2649428 (10yuvipanda) [17:47:01] 06Labs, 10PAWS, 06Operations, 06Research-and-Data, 10hardware-requests: Purchase new labsdbs for PAWS / Quarry - https://phabricator.wikimedia.org/T146061#2649430 (10yuvipanda) [17:47:05] Exactly [17:47:38] This query was real either 2 or 3 minutes [17:48:17] Since a day of 5 that exploded to 116min (KILL) [17:49:02] jynus: grep -B 5 data/project/multichill/queries/commons/paintings_without_wikidata_ci.txt /data/project/multichill/logs/find_painting_images.log | grep real [17:49:56] "Since a day of 5" what does that mean? [17:50:13] jynus: Sorry, since about 5 days ago [17:50:42] maybe WLM has some stress on commons replica server? [17:51:05] I see high memory and cpu usage [17:51:16] 12 runs ago, 2 times a day so something changed about 6 or 7 days ago. Maybe. any graphs that have gone up a lot around that time? [17:53:42] jynus: 42575053, that's me [17:54:10] The number of rows is insane [17:54:52] maybe try splitting the query in smaller chunks [17:55:02] 06Labs, 06Operations, 06Research-and-Data, 10hardware-requests: eqiad: 2 hardware access request for research labsdbs - https://phabricator.wikimedia.org/T146065#2649465 (10chasemp) p:05Triage>03Normal [17:57:25] The query hasn't been changed since April and would just complete in a normal time. Without knowing the source of the problem it would be a bit pointless [18:08:28] 06Labs, 06Operations, 06Research-and-Data-Backlog, 10hardware-requests: eqiad: 2 hardware access request for research labsdbs - https://phabricator.wikimedia.org/T146065#2649537 (10ggellerman) [18:14:48] jynus: The server hosting the Commons database as a high load, is any of the other servers less busy so I can test if it does complete and run in a normal time on that one? [18:15:25] you can hardcode using enwiki database [18:15:35] ok for a test [18:16:03] ok, running now [18:16:48] Would it be possible that for some reason indexes are not used or something in that direction? I can't use describe because it's a view.... [18:17:39] you should be able to explain the connection [18:17:52] for a long running query [18:18:19] http://s.petrunia.net/blog/?p=89 [18:23:16] Oh that's awesome, didn't know that one jynus! [18:24:02] I think I am going to do an emergency restart of labsdb1003 [18:24:17] if I do not do it, it will explode and it will be worse [18:24:31] it is 1 step near exhausting all memory [18:25:24] explode? [18:25:27] duh [18:25:35] hope that it's not literally [18:25:58] well, I prefer to restart the server unnanounced and being able to startit back [18:26:11] than it crashing and not being able to start it again [18:27:00] it is swapping like crazy: https://grafana-admin.wikimedia.org/dashboard/db/mysql?panelId=40&fullscreen&from=1474223207596&to=1474309607597&var-dc=eqiad%20prometheus%2Fops&var-server=labsdb1003 [18:27:25] 10PAWS, 06Research-and-Data-Backlog: Create a mailing list for PAWS - https://phabricator.wikimedia.org/T129297#2101483 (10leila) @DarTar, do you want this to happen? If not, we can close it and open it as needed in the future. [18:34:20] we will suffer some turbulences, please keep your seat belts fasten [18:37:02] You can always ask reedy to be your co-pilot [18:46:46] multichill, try now on commons [18:46:50] it should be much better [18:47:16] no swapping [18:49:53] we will see how long it lasts... [18:51:58] Running [19:07:02] jynus: Bummer, it's still running. Going to kill it and disable the job for now. No sense in hammering the database servers if no result is produced...... [19:25:11] 06Labs, 06Operations, 06Research-and-Data-Backlog, 10hardware-requests: eqiad: 2 hardware access request for research labsdbs - https://phabricator.wikimedia.org/T146065#2649945 (10chasemp) hi @ggellerman thanks! I believe this has been specially budgeted for in Q2 of 2016 and should work within that budg... [19:25:56] 10Tool-Labs-tools-Erwin's-tools: Unknown Error/MySQL errors - https://phabricator.wikimedia.org/T140421#2649959 (10Nemo_bis) That's just replag https://tools.wmflabs.org/replag/ [19:37:20] PROBLEM - Puppet run on tools-docker-builder-01 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [19:39:54] 06Labs, 07Tracking: Existing Labs project quota increase requests (Tracking) - https://phabricator.wikimedia.org/T140904#2649993 (10chasemp) [19:39:56] 06Labs, 10Beta-Cluster-Infrastructure: Please raise quota for deployment-prep - https://phabricator.wikimedia.org/T145611#2649990 (10chasemp) 05Open>03Resolved a:03chasemp [19:43:31] legoktm: start in 15min? [19:43:43] yuvipanda: sure. I already created the instance [19:45:10] 06Labs, 07Tracking: Existing Labs project quota increase requests (Tracking) - https://phabricator.wikimedia.org/T140904#2650019 (10chasemp) [19:45:12] 06Labs, 10Beta-Cluster-Infrastructure: Request increased quota for deployment-prep labs project - https://phabricator.wikimedia.org/T145636#2650016 (10chasemp) 05Open>03Resolved a:03chasemp should be gtg, there are a few stacked quota bumps for deployment-prep so let me know @fgiunchedi if you get hung u... [19:45:28] 06Labs, 10Beta-Cluster-Infrastructure: Please raise quota for deployment-prep - https://phabricator.wikimedia.org/T145611#2650022 (10hashar) New quotas: | Cores | 171/192 | RAM | 350208/392400 [19:45:35] legoktm: ok. did you already set it up with role::puppet::self? [19:45:38] if not don't do it! [19:45:42] no [19:45:47] ok [19:45:47] just literally created the instance [19:46:22] ah ok :) [19:46:25] legoktm: as jessie? [19:46:30] of course :) [19:47:03] legoktm: can you apply the role 'role::puppetmaster::standalone'? [19:47:50] I have to use wikitech for that right? [19:48:00] legoktm: yup [19:48:05] horizon will get it in the next few days [19:49:41] > Modified instance (integration-puppetmaster01). [19:50:27] yuvipanda: do I need to force a puppet run or anything? [19:50:44] legoktm: yeah [19:50:53] or you can wait for the automatic run sometime in next 30min but force :D [19:52:38] The last Puppet run was at Mon Sep 19 19:48:39 UTC 2016 (3 minutes ago). [19:52:51] heh [19:52:57] not sure if that was in time :S [19:52:59] also [19:53:00] integration-puppetmaster01 is a Puppet client of integration-puppetmaster.integration.eqiad.wmflabs (puppetclient) [19:53:03] is that going to cause problems? [19:53:26] apparently [19:53:30] I tried to force puppet [19:53:31] Error: Could not retrieve catalog from remote server: Error 400 on SERVER: Duplicate declaration: File[/etc/puppet/fileserver.conf] is already declared in file /etc/puppet/modules/puppet/manifests/self/config.pp:101; cannot redeclare at /etc/puppet/modules/puppetmaster/manifests/config.pp:26 on node integration-puppetmaster01.integration.eqiad.wmflabs [19:54:02] that's a different problem, caused by the fact that role::puppet::self is applied to all instances by the hiera config for the project [19:54:55] puppetmasters don't need to be their own clients, do they? [19:54:57] legoktm: I just did https://wikitech.wikimedia.org/wiki/Hiera:Integration/host/integration-puppetmaster01 [19:54:59] legoktm: try now? [19:55:06] can just override the puppetmaster on the specific new puppetmaster instance? [19:55:20] Krenair: that is irrelevant to this particular issue he's having tho [19:55:39] yeah I was thinking about the other thing [19:55:42] which is that the puppetmaster class and the puppet class can't co-exist, because role::puppet::self sets up the whole puppetmaster config and stuff even if it's only a client [19:55:50] yuvipanda: same error [19:56:06] legoktm: do a git pull -r origin production on the integration puppetmaster? [19:56:08] I think those lag behind by upto ten minutes [19:56:50] uh, where is the repo again? [19:57:03] /var/lib/...something? [19:57:42] /var/lib/git/operations/puppet [19:57:53] uh, /var/lib/git is empty [19:58:02] root@integration-puppetmaster01:/var/lib/git# ls [19:58:45] legoktm: ok, I'm gonna poke around for a sec [19:58:50] go for it [20:00:32] man, I hate role::puppet::self so much [20:01:16] legoktm: ok, I'm going to remove role::puppet::self from Hiera:Integration [20:01:32] how much will that break everything else? [20:01:34] legoktm: I'm tempted to disable puppet across the instances now [20:01:50] this doesn't affect contintcloud right? [20:01:57] nope [20:01:59] just integration [20:02:01] ok [20:02:05] should be fine for now then [20:02:10] just !log in -releng? [20:02:53] yeah ok [20:04:15] PROBLEM - Puppet run on tools-precise-dev is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [20:04:53] PROBLEM - Puppet run on tools-puppetmaster-02 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [20:08:09] PROBLEM - Puppet run on tools-exec-1206 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [20:09:41] RECOVERY - Puppet run on tools-docker-builder-03 is OK: OK: Less than 1.00% above the threshold [0.0] [20:13:34] PROBLEM - Puppet run on tools-k8s-etcd-01 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [20:14:04] PROBLEM - Puppet run on tools-webgrid-lighttpd-1409 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [20:14:06] PROBLEM - Puppet run on tools-webgrid-lighttpd-1408 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [20:14:36] PROBLEM - Puppet run on tools-worker-1009 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [20:15:18] ^ problem? [20:15:28] PROBLEM - Puppet run on tools-worker-1017 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [20:15:54] PROBLEM - Puppet run on tools-services-01 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0] [20:16:22] yeah looking [20:17:20] PROBLEM - Puppet run on tools-flannel-etcd-02 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [20:17:26] PROBLEM - Puppet run on tools-webgrid-lighttpd-1209 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [20:18:06] no idea, it worked on the instance I just tried [20:18:06] PROBLEM - Puppet run on tools-exec-1216 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [20:18:12] PROBLEM - Puppet run on tools-worker-1016 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [20:18:15] PROBLEM - Puppet run on tools-exec-gift is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [20:19:15] PROBLEM - Puppet run on tools-exec-1212 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [20:19:37] well, I'm in the middle of something else that is also time sensitive (integration switchover), so I'm going to let this fire burn [20:26:17] yuvipanda: I tried to catch a few and so far they both succeed for me so...idk yet, slow but no errors [20:26:26] so not sure but seems either not urgent or transient [20:26:51] yeah [20:34:12] RECOVERY - Puppet run on tools-exec-1212 is OK: OK: Less than 1.00% above the threshold [0.0] [20:39:35] RECOVERY - Puppet run on tools-worker-1009 is OK: OK: Less than 1.00% above the threshold [0.0] [20:43:09] RECOVERY - Puppet run on tools-exec-1206 is OK: OK: Less than 1.00% above the threshold [0.0] [20:44:16] RECOVERY - Puppet run on tools-precise-dev is OK: OK: Less than 1.00% above the threshold [0.0] [20:44:54] RECOVERY - Puppet run on tools-puppetmaster-02 is OK: OK: Less than 1.00% above the threshold [0.0] [20:53:07] RECOVERY - Puppet run on tools-exec-1216 is OK: OK: Less than 1.00% above the threshold [0.0] [20:53:13] RECOVERY - Puppet run on tools-worker-1016 is OK: OK: Less than 1.00% above the threshold [0.0] [20:53:37] RECOVERY - Puppet run on tools-k8s-etcd-01 is OK: OK: Less than 1.00% above the threshold [0.0] [20:55:29] RECOVERY - Puppet run on tools-worker-1017 is OK: OK: Less than 1.00% above the threshold [0.0] [20:55:53] RECOVERY - Puppet run on tools-services-01 is OK: OK: Less than 1.00% above the threshold [0.0] [20:56:28] 10PAWS: python3-tk package missing - https://phabricator.wikimedia.org/T145362#2650344 (10Tbayer) 05Open>03Resolved a:03yuvipanda >>! In T145362#2629234, @yuvipanda wrote: > Have you tried using `%matplotlib inline` instead of `%matplotlib`? The > former works better in notebooks. Yes, that works, thanks!... [20:57:19] RECOVERY - Puppet run on tools-flannel-etcd-02 is OK: OK: Less than 1.00% above the threshold [0.0] [20:57:23] RECOVERY - Puppet run on tools-webgrid-lighttpd-1209 is OK: OK: Less than 1.00% above the threshold [0.0] [20:58:13] RECOVERY - Puppet run on tools-exec-gift is OK: OK: Less than 1.00% above the threshold [0.0] [20:59:04] RECOVERY - Puppet run on tools-webgrid-lighttpd-1409 is OK: OK: Less than 1.00% above the threshold [0.0] [20:59:06] RECOVERY - Puppet run on tools-webgrid-lighttpd-1408 is OK: OK: Less than 1.00% above the threshold [0.0] [21:22:58] 06Labs, 10Tool-Labs, 06Collaboration-Team-Triage, 06Community-Tech-Tool-Labs, and 5 others: Enable Flow on wikitech (labswiki and labtestwiki), then turn on for Tool talk namespace - https://phabricator.wikimedia.org/T127792#2650437 (10Catrope) [21:54:01] PROBLEM - Puppet run on tools-checker-01 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [21:54:38] PROBLEM - Puppet run on tools-worker-1021 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [21:54:40] PROBLEM - Puppet run on tools-exec-1209 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [21:54:44] PROBLEM - Puppet run on tools-worker-1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [0.0] [21:55:20] PROBLEM - Puppet run on tools-webgrid-lighttpd-1411 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [21:56:21] ^ all transient [21:57:37] maybe clients are grouping on the in-project master and causing issues? [21:58:00] chasemp: no, I merged the change which caused a puppetmaster restart [21:58:14] so these were the ones that were running when the restart happened [21:58:21] gotcha [21:58:48] I'm writing docs now [21:59:46] RECOVERY - Puppet run on tools-worker-1001 is OK: OK: Less than 1.00% above the threshold [0.0] [22:09:03] RECOVERY - Puppet run on tools-checker-01 is OK: OK: Less than 1.00% above the threshold [0.0] [22:14:33] 06Labs, 06Operations, 06Research-and-Data-Backlog, 10hardware-requests: eqiad: 2 hardware access request for research labsdbs - https://phabricator.wikimedia.org/T146065#2650663 (10ggellerman) @chasemp Hi! This was on the Research & Data workboard. Because it looks like Yuvi is doing the work, we moved i... [22:29:41] RECOVERY - Puppet run on tools-worker-1021 is OK: OK: Less than 1.00% above the threshold [0.0] [22:29:41] RECOVERY - Puppet run on tools-exec-1209 is OK: OK: Less than 1.00% above the threshold [0.0] [22:30:21] RECOVERY - Puppet run on tools-webgrid-lighttpd-1411 is OK: OK: Less than 1.00% above the threshold [0.0] [22:48:07] 06Labs, 06Operations, 06Research-and-Data-Backlog, 10hardware-requests: eqiad: 2 hardware access request for research labsdbs - https://phabricator.wikimedia.org/T146065#2650788 (10DarTar) @chasemp @ggellerman yes, this is part of dedicated capex budget for FY16-17. If there are separate tickets where app... [23:09:11] 10PAWS, 06Research-and-Data-Backlog: Create a mailing list for PAWS - https://phabricator.wikimedia.org/T129297#2650857 (10DarTar) @leila this is now subject to the launch timeline, which is realistically going to be in Q3. I'm fine closing it since we'll need to plan the announcement/support strategy when the...