[00:16:08] fixing puppet on icinga server, follow-up to an earlier change with notes_links/notes_urls [00:37:42] fixed. this applied a lot of changes coming from https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/509365/ now and also that other change that gives Icinga a user agent [00:40:32] they had just never been applied because of the puppet fail. icinga working. going off now [00:42:15] arr.. or not, because now that this works i get to see another error that comes from my own merge re: gerrit contact groups, heh [00:52:02] ok, done. Total Errors/Warnings: 0 in icinga config again. now ok [00:53:21] off [07:54:08] FYI, I'll reboot cumin2001 in a bit, please don't use it for new screen sessions/reimages for now [08:04:24] cumin2001 is back, please use it for reimages etc. for now (cumin1001 is up next) [08:04:42] moritzm: when do you plan to do cumin1001? [08:05:32] I'll start hunting down screen sessions etc. in a bit, but if anyone is long-running and can't be skipped, we'll just re-attempt a different time [08:05:42] there's one current reimage running [08:06:48] ok, I just finished a long-running one, so fine from my side as of today [08:25:52] cumin1001 is rebooted and good to use again [08:31:26] thanks [08:53:06] any objections against a reboot of deploy1001 in the next 10 minutes? there are no deployments scheduled today, but maybe there's something currently needing a DB change in wmf-config or similar? [08:53:32] moritzm: ok from the DB side [08:56:06] thanks, I'll proceed in a few minutes, then [09:10:43] deploy1001 is back up [09:30:31] FYI, I'll reboot netmon1002 in a bit, speak up if it's a bad time [09:38:14] it includes netbox (JFYI) [09:41:43] it's back up already :-) [11:49:10] moritzm: so keyholder on deploy1001 says only the analytics key is armed [11:49:39] just to confirm, it was not deployed: https://noc.wikimedia.org/conf/highlight.php?file=db-eqiad.php&1 [11:49:48] the keyholder is re-armed, but then it was asking me for the modsec one we don't have [11:50:02] I assumed arming the first would arm all except the modsec one [11:50:22] eh [11:50:26] it does them in alphabetic order [11:50:29] let me hack it [11:50:56] from https://phabricator.wikimedia.org/T224887#5232493 [11:52:27] ok done, jynus retry please [11:52:39] doinf [11:53:02] I've temproarily moved private and public key in /etc/keyholder.d from apache2modsec to zapache2modsec [11:53:03] it is very slow [11:53:45] 11:53:26 Check 'Logstash Error rate for mw1278.eqiad.wmnet' failed: ERROR: 50% OVER_THRESHOLD (Avg. Error rate: Before: 0.10, After: 2.00, Threshold: 1.00) [11:54:45] I don't see an increased error rate, though [11:55:13] interesting, scap canaries are not those defined in A:mw-canaries [11:55:33] so I've deployed scap to: mw[1261-1265].eqiad.wmnet,mwdebug[2001-2002].codfw.wmnet,mwdebug[1001-1002].eqiad.wmnet (9 hosts) [11:55:41] s/deployed/upgraded/ [11:56:07] but when scap deploys it actually checks the error rates on a different sets of hosts [11:56:24] I don't know if they were supposed to be in sync or not [11:56:40] yeah, in this context mw canaries are the hosts with the app server canary role [11:56:47] 11:53:26 Canary error check failed for 1 canaries, less than threshold to halt deployment (2/11), see https://logstash.wikimedia.org/goto/db09a36be5ed3e81155041f7d46ad040 for details. Continuing... [11:56:51] as those are usually used to stage new extentions, HHVM etc, [12:01:31] jynus: any error related to the hosts I listed above? [12:01:51] no [12:02:21] ack, thanks [12:07:47] I'm going to lunch but ping if needed [12:43:48] jynus: I'm wondering if the error you got earlier was due to: [12:43:49] https://gerrit.wikimedia.org/r/c/mediawiki/tools/scap/+/519074 [12:43:53] that was included in this release too