[04:23:05] 10DBA: db2097 (backup source) restarted itself - https://phabricator.wikimedia.org/T252492 (10Marostegui) [04:41:16] 10DBA: db2097 (backup source) restarted itself - https://phabricator.wikimedia.org/T252492 (10Marostegui) HW logs: ` /system1/log1/record38 Targets Properties number=38 severity=Critical date=05/12/2020 time=01:50:28 description=Uncorrectable Machine Check Exception (Processor 2, APIC ID... [04:45:50] 10DBA, 10cloud-services-team (Kanban): Reimage labsdb1011 to Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T249188 (10Marostegui) The import has finished - I am going to stop MySQL and copy the content to backup1001 and then I'll attempt to configure replication. [04:51:36] 10DBA, 10Data-Services: database query timeouts since April 26th 2020 - https://phabricator.wikimedia.org/T252209 (10Marostegui) 05Open→03Declined Declining as nothing can be done at the moment to mitigate the current timeouts. We have finished reimporting all the data into labsdb1011, if all goes well, we... [05:07:01] 10DBA, 10Operations, 10User-notice: Upgrade and restart s4 (commonswiki) primary database master: Tue 12th May - https://phabricator.wikimedia.org/T251502 (10Marostegui) 05Open→03Resolved This has been done. RO starts: 05:00:54 RO stops: 05:03:40 [05:07:04] 10DBA, 10Operations, 10Puppet, 10User-jbond: DB: perform rolling restart of mariadb daemons to pick up CA changes - https://phabricator.wikimedia.org/T239791 (10Marostegui) [05:07:38] 10DBA, 10Operations, 10User-notice: Upgrade and restart s4 (commonswiki) primary database master: Tue 12th May - https://phabricator.wikimedia.org/T251502 (10Marostegui) [05:07:56] 10DBA, 10Operations, 10Puppet, 10User-jbond: DB: perform rolling restart of mariadb daemons to pick up CA changes - https://phabricator.wikimedia.org/T239791 (10Marostegui) [05:36:41] 10DBA: db2097 (backup source) restarted itself - https://phabricator.wikimedia.org/T252492 (10jcrespo) The same thing happened at T225378#5303653 ` root@db2097:~$ date Tue May 12 05:36:21 UTC 2020 root@db2097:~$ free -m total used free shared buff/cache available Mem:... [05:41:43] 10DBA: db2097 (backup source) restarted itself - https://phabricator.wikimedia.org/T252492 (10jcrespo) ` 462 - Uncorrectable Memory Error Threshold Exceeded (Processor 2, DIMM 4). The DIMM is mapped out and is currently not available. Action: Take corrective action for the failing DIMM. Re-map all DIMMs back in... [07:12:39] there is some ongoing lagging on some multiinstance s6 hosts since yesterday [08:11:22] db1113 lagging for 5 minutes already [08:12:31] ok I will take a look, I am busy with clinic duty stuff [08:13:03] could be HTMLCacheUpdateJob::invalidateTitles ? [08:15:41] big spike on writes on all the sections [08:15:45] https://grafana.wikimedia.org/d/000000278/mysql-aggregated?panelId=7&fullscreen&orgId=1 [08:16:05] I think it is mostly s6 [08:16:15] see https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=db1113&var-port=13316&from=1589270050436&to=1589271318426&panelId=6&fullscreen [08:16:38] lots of inserts in a single transaction [08:16:55] https://grafana.wikimedia.org/d/kUxjfbeWk/mysql-kormat?panelId=2&fullscreen&orgId=1&from=now-24h&to=now&var-dc=eqiad%20prometheus%2Fops&var-server=db1131&var-port=9104 [08:17:08] right after a deploy [08:17:33] not hw, the other instance on the same host did not lag [08:18:06] although it is true there is weird traffic patterns [08:18:19] 17:04 brennen@deploy1001: rebuilt and synchronized wikiversions files: all wikis to 1.35.0-wmf.31 [08:18:25] right after that the master spiked on writes [08:18:44] but that is yesterday, right? [08:18:54] yes, but the master is still having lots more INSERTs [08:18:58] I see [08:19:06] https://grafana.wikimedia.org/d/kUxjfbeWk/mysql-kormat?panelId=2&fullscreen&orgId=1&from=now-2d&to=now&var-dc=eqiad%20prometheus%2Fops&var-server=db1131&var-port=9104 [08:19:16] I am going to create a task for that version [08:19:22] Will capture a bit of the traffic on the master [08:19:34] but it is very suspiciuous it started right after the deployment [08:19:35] I think it is the invalidate stuff [08:19:43] I mentioned before [08:19:45] have you checked binlogs? [08:19:48] no [08:19:53] let me take a look [08:19:56] it is a guess [08:20:03] although it could be a synthom, not a cause [08:20:14] marostegui: https://grafana.wikimedia.org/d/000000303/mysql-replication-lag?panelId=6&fullscreen&orgId=1&from=1589269504265&to=1589271510297&var-dc=eqiad%20prometheus%2Fops [08:20:19] if you filter out the labs hosts [08:20:39] it points to an app issue (all hosts affected at the same time) [08:20:59] from 7:57 [08:22:07] On the master i see most of the writes like INSERT /* Wikimedia\DependencyStore\SqlModuleDependencyStore::storeMulti */ INTO `module_deps` (md_module,md_skin,md_deps) [08:22:12] oh [08:22:25] that was a previous issue [08:22:30] let me find it [08:22:46] I am not seenig that huge amount on previous binlogs (from before the release) [08:22:59] was that happening now or yesterday? [08:23:07] https://phabricator.wikimedia.org/T247028 [08:23:10] This is affecting everything https://grafana.wikimedia.org/d/000000273/mysql?panelId=2&fullscreen&orgId=1&from=now-24h&to=now&var-dc=eqiad%20prometheus%2Fops&var-server=db1083&var-port=9104 [08:23:12] That's enwiki master [08:23:15] same huge increase [08:23:28] see the task [08:23:31] Let me comment on that task yeah [08:23:42] we ended up lowering importance [08:23:48] because back then it didn't cause lag [08:23:52] I will raise it again [08:23:55] so it was not ideal but manageable [08:24:07] now it is causing lag, so it should be high [08:24:32] ping krinke, as he was around last time, he may want to know [08:24:48] yeah, he's subscribed [08:25:56] "The table stores absolute paths which means when a wmf-branch roll over, it loses track of some files, thus causing a needless cache invalidation" [08:26:08] that seems to be fixed [08:26:19] but confirms there is a relation between the invalidation and that table [08:41:18] 10DBA: Productionize db114[1-9] - https://phabricator.wikimedia.org/T252512 (10Marostegui) [08:41:45] 10DBA: Productionize db114[1-9] - https://phabricator.wikimedia.org/T252512 (10Marostegui) p:05Triage→03Medium [09:18:11] 10DBA, 10Operations, 10ops-codfw: db2097 (backup source) restarted itself - https://phabricator.wikimedia.org/T252492 (10Marostegui) This host is under warranty from what I can see, so maybe we should get a new memory DIMM from HP? That is what we did when it happened at T225378 [09:21:29] ^man that is the plan :-D but I want to solve the sw issue first [09:21:38] in particular, the failed dumps [09:22:01] I am worried about the 15% less data on dumps, that smells corruption on backup taking [09:47:28] eh, ok, sorry for posting [09:48:09] oh, don't be sorry [09:48:14] it was just a FYI [09:48:23] that I am on top of it :-D [09:49:48] ok thanks [09:50:44] man == manuel, not Man!!! :-D [10:25:35] marostegui: tendril shows that pc1 currently has 4 hosts, but both pc2 and pc3 only have 2 each [10:26:03] that seems odd [10:27:05] ohh. i'm misreading my own notes [10:27:34] we have 3 sections, 1 machine per DC per section [10:27:37] plus some spares [10:30:34] T252182 says pc2010 has been upgraded, but it has not [10:30:35] T252182: Upgrade parsercache to Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T252182 [10:37:02] kormat: yeah, looks like I didn't upgrade pc2010 even though I thought I did [10:37:56] should i do that before the production machines? [10:38:23] yeah please [10:40:10] grand, on it. [10:40:46] 10DBA: Upgrade parsercache to Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T252182 (10Kormat) Correction: pc2010 was not upgraded. [10:40:50] thank you [10:40:56] np :) [10:42:19] marostegui: as i'm not (yet) touching the master, this is just the normal reimage procedure, right? [10:42:33] for pc2010 yep [10:42:34] (i know that touching the master requires changing the mw config, and reordering replication) [10:42:36] and for pc2007 also [10:42:47] no need to re-order replication for this "quick" task [10:42:58] but we will get there once you touch pc1007 or pc1009 [10:43:06] if pc2XXX that means it is not active master [10:43:06] ack :) [10:43:14] the ones that indeed require MW depooling are pc1XXX [10:43:22] * kormat nods [10:46:33] marostegui: for you: https://gerrit.wikimedia.org/r/c/operations/puppet/+/595901 [10:59:32] 10DBA, 10Patch-For-Review: Upgrade parsercache to Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T252182 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by kormat on cumin1001.eqiad.wmnet for hosts: ` ['pc2010.codfw.wmnet'] ` The log can be found in `/var/log/wmf-auto-reimage/202005121... [12:12:49] hurm. puppet agent is erroring out after reimage, because it can't connect to mariadb ☹️ [12:13:06] ah yes, because of pt-hearbeat [12:13:07] right? [12:13:11] yep [12:13:29] so, mount /srv and all that, start mysql, run puppet, stop mysql and let the installation finish :) [12:13:51] i.. wow. [12:13:51] if you don't do that, it will finally fail and say that the installation never finished correctly, but it is a lie :) [12:13:54] it actually did [12:13:59] it is just a race condition [12:14:10] you can leave it fail too, it doesn't matter at that point [12:14:18] it's a shit condition. :) [12:14:45] yeah, for those that run pt-heartbeat yes [12:15:18] 10DBA: Upgrade parsercache to Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T252182 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['pc2010.codfw.wmnet'] ` and were **ALL** successful. [12:17:54] do we have a task for this? [12:18:34] about that race condition? [12:18:37] no that I know of [12:18:59] ok, i'll file one for tracking purposes [12:19:08] but yeah, when mysql is stopped, pt-kill and pt-heartbeat will fail [12:19:18] and as they are suppsoed to be started by puppet.. [12:19:44] why does puppet manage services? isn't that systemd's job? [12:20:14] yeah, sorry, for pt-kill it is not [12:20:26] I don't remember what we did actually [12:20:28] I can check later [12:20:42] ok :) it's not important [12:20:51] you can check on labsdb1009 [12:20:57] those are the only hosts that run pt kill [12:23:51] 10DBA, 10Operations: wmf-auto-reinstall fails on hosts that run pt-heartbeat - https://phabricator.wikimedia.org/T252528 (10Kormat) [12:23:54] ^ created [12:24:07] thanks - can you move it into the backlog column too? [12:26:27] 10DBA, 10Operations: wmf-auto-reinstall fails on hosts that run pt-heartbeat - https://phabricator.wikimedia.org/T252528 (10Marostegui) p:05Triage→03Medium This is "expected" as the host doesn't have MySQL up and running. This is pretty much the last step of the script, so even if the installation reports... [12:30:32] done [12:31:52] thanks :* [12:41:39] `REVOKE DELETE HISTORY ON *.* from 'dump'@'X.X.X.X';` - is that a literal command to run? or does the ip need to be specified? [12:41:59] (mm. we probably don't bother backing up pc, do we?) [12:42:09] First check if the user exists, which I doubt on the parsercache hosts [12:42:43] you can do a quick check with select user,host from mysql.user where user like 'dump'; [12:43:00] thanks <3 [12:43:03] nope, doesn't exist [12:43:09] (on pc2010) [12:43:25] cool, we don't backup pc of course :) [12:53:12] marostegui: it looks like it's going to take a Long time for pc2010 to catch back up with replication - is that expected? [12:53:16] https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-dc=codfw%20prometheus%2Fops&var-server=pc2010&var-port=9104&from=now-1h&to=now [12:53:56] hmm. i'd estimate that it's i/o bound writing to disk [12:54:41] yes, it is expected [12:54:47] it usually takes a bit [12:54:59] alright. i'll continue with a production host in the meantime then? [12:56:30] which one? [12:56:48] i was going to close my eyes and just stab my finger at tendril [12:56:54] but, i guess, pc2007? [12:56:54] :) [12:57:02] cool, sounds good [12:57:13] keep in mind that as it is codfw, it doesn't receive any traffic [12:57:59] we should make a 'lonely' puppet role for these poor hosts [13:00:59] marostegui: this (how long it will take for pc2010 to resync) is a very clear reason for why we disable notifications for reimaging db hosts. i'll bear that in mind for future discussions with volans ;) [13:01:33] yeah, in this case pc2010 wouldn't page (because it is codfw), but with any other eqiad host, it would, that's why :) [13:01:45] right, good point :) [13:04:06] kormat: although that's debatable [13:04:15] * kormat dons his debating gloves [13:04:35] a host that is up and running and just re-syncing, should be healthy apart for 1~2 checks [13:04:54] volans: yeah, but replication would still page [13:04:55] right now pc2010 has only a Lag red alert in icinga [13:05:28] so would be better to have all checks running as normal, not downtimed, not silenced and just the expected ones (lag) that clearly are not ready downtimed/disabled [13:05:30] marostegui: https://gerrit.wikimedia.org/r/c/operations/puppet/+/595922 for you (i used a separate line for pc2007 in the hopes that i'll still be able to do an auto git revert for the pc2010 change) [13:05:31] volans: I think the point is that the host might get out of downtime if for any reason the catching up takes long [13:05:55] marostegui: sure, I'm saying that we should have something dedicated in addition to the simple short downtime for the reimage [13:06:00] that covers only the bits we're interested [13:06:34] for example, wouldn't you want to know if RAID fails on that host during the resync before putting it back in production? :D [13:07:46] what I'm trying to say, is that after a reimage a host should be healthy, with maybe some service and related check not yet happy (and that's ok and we should aim at those instead of the whole host) [13:10:59] volans: are you saying that wmf-auto-reimage could have per-service knowledge about what to downtime after a host gets reimaged? [13:11:38] host comes up -> downtime everything until after initial puppet runs -> downtime only services that might need some time (maybe a few hours) to recover? [13:12:21] kormat: depends, wmf-auto-reimage already downtimes the whole host for 2h (IIRC), given it's not the one taking care of any post-install action [13:12:41] I'd say that is more the post-install action that should be able to downtime what it needs [13:12:51] like the resync downtiming the lag for examaple [13:12:54] *example [13:13:21] the only case not covered in this case is a reimage where the post-actions are not started within 2h from the reimage [13:13:38] that seems unlikely, but YMMV [13:14:17] volans: so wmf-auto-reimage doesn't undowntime when it finishes currently? [13:14:23] nope [13:14:30] huh [13:14:45] that sounds like something you're arguing against above, too [13:15:04] it might in the future once converted to cookbook (the helper functions for that are in spicerack) [13:15:12] but most likely will do that only if it's all green [13:15:39] (we have something to check that, not yet in prod) [13:17:02] the reason why it's not doing that was lack of support for checking state and downtime removal. don't forget that wmf-auto-reimage is "legacy" as in pre-spicerack [13:17:09] gotcha [13:17:27] well, that sounds like something our Future And Amazing Automation could work with [13:18:07] ™ [13:18:08] :D [13:29:19] marostegui: i have some heresy for you: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/595928 [13:29:35] (rather than making a CR per host that gets switched to buster, i've made one just switching all of the pc* hosts to buster) [13:30:11] +1 fwiw : [13:30:13] :) [13:31:39] merging quickly before marostegui changes his mind :) [13:32:28] xddddd [13:32:36] I did +1 [13:39:21] 10DBA, 10Patch-For-Review: Upgrade parsercache to Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T252182 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by kormat on cumin1001.eqiad.wmnet for hosts: ` ['pc2007.codfw.wmnet'] ` The log can be found in `/var/log/wmf-auto-reimage/202005121... [13:52:23] marostegui: er. pc2007 has no link [13:53:17] woot? [13:53:20] for pxe? [13:53:23] yep [13:53:57] maybe reboot back with disk and check if the MAC that shows on the dhcp file is the one that asks for link [13:54:38] i can confirm that eno1 has the same mac address as is in the dhcpd config [13:55:22] and the same that asks for PXE? [13:55:37] that i'm trying to figure out how to check [13:56:38] you can check the dhcp logs [13:56:56] i'm on install2003 - is that the right server? [13:57:05] yeah, or install1003 [13:57:13] they are the dhcp/tftp [13:57:21] a codfw host could end up talking to install1003? [13:57:29] yes, whichever responds faster [13:57:33] ohh. i see. [13:59:05] `May 12 13:43:45 install2003 dhcpd[2808]: DHCPACK on 10.192.0.104 to d0:94:66:86:2e:4d via 10.192.0.3` [13:59:14] that says the right mac address got the right ip.. [14:00:17] yeah, that looks like the one on the dhcpd file yeah [14:00:34] so maybe you can open a ticket to DCOPs and CC netops to see if they can give you a hint [14:00:44] that would be: netops tag and ops-codfw tag [14:02:33] there's something weird going on here. i can bring the interface up manually, and it does have an ip configured on it [14:03:56] maybe the switch is doing something funky? [14:03:57] aaand now it's all mysteriously working [14:04:07] loose cable? [14:05:19] i dunno, it feels like an issue with d-i [14:05:30] (but maybe that's because the more i learn about d-i the less i trust the fucker :) [14:05:53] but this isn't the first pc2XXX we reimage [14:08:30] the install is running fine. i'll see if i can have a look through the logs when it comes back up [14:24:22] hmm. well, it _was_. then it hung. it turns out it's trying to install stretch. but also it looks like there's a bug in the preseed config we use: [14:24:37] `/var/lib/dpkg/info/network-preseed.postinst: eval: line 1: syntax error: unexpected ";;"` [14:25:42] agh. that's my fault. i have a typo in my netboot.cfg change. [14:28:03] and mine for not catching it [14:29:47] kormat: wanna know a fun fact about that typo? [14:30:36] marostegui: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/595945 [14:30:40] volans: go on :) [14:31:10] that breaks the reimage from your host and all the following in the case/esac list, but works fine for all the ones defined before the typo [14:31:37] volans: oh ouch [14:31:48] joys of bash :D [14:32:10] the sooner we remove humans from editing that the better :) [14:32:35] yeah [14:47:22] lag on pc2010, I am guessing expected [14:47:31] yes [14:47:43] 10DBA, 10Patch-For-Review: Upgrade parsercache to Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T252182 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by kormat on cumin1001.eqiad.wmnet for hosts: ` ['pc2007.codfw.wmnet'] ` The log can be found in `/var/log/wmf-auto-reimage/202005121... [14:47:54] 10DBA, 10Patch-For-Review: Upgrade parsercache to Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T252182 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['pc2007.codfw.wmnet'] ` Of which those **FAILED**: ` ['pc2007.codfw.wmnet'] ` [14:48:33] `14:47:39 | pc2007.codfw.wmnet | Unable to run wmf-auto-reimage-host: Failed to icinga_downtime` [14:48:42] kormat: use --no-downtime [14:48:55] cheers [14:49:16] 10DBA, 10Patch-For-Review: Upgrade parsercache to Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T252182 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by kormat on cumin1001.eqiad.wmnet for hosts: ` ['pc2007.codfw.wmnet'] ` The log can be found in `/var/log/wmf-auto-reimage/202005121... [14:51:18] kormat: pro-tip: for codfw hosts is quicker to launch it from cumin2001 ;) [14:52:18] how much quicker are we talking? [14:53:23] my guess is not that much, just few round trips, but the whole first puppet run has to go cross-dc [14:53:29] (the log of it) [14:53:39] I would do it more for reliability too [14:54:27] ok, added to notes. *sobs* [14:55:14] it is not the first time that we run a 4-hour alter table and in the middle of it a host it loses connection for a second [14:55:33] since then I tend to run them -if long- from localhost [14:56:08] don't think just network- e.g. "we need to reboot cumin" or whatever [14:58:41] I am going offline, I started at 6.30am [14:58:44] More than enough [14:59:10] have a nice day [14:59:13] labsdb1011 -> backup1001 transfer still ongoing [14:59:15] 2TB or so pending [14:59:17] ok [14:59:24] will not touch it [14:59:30] have my own list of things to fix :-D [14:59:46] tomorrow I will do the start+mysql_upgrade+replication [15:00:05] * jynus crosses fingers [15:00:20] * marostegui too and looks at kormat's [15:00:43] * kormat is worried marostegui might be hungry, and hides his fingers [15:00:57] XDD [15:01:03] I am actually quite hungry [15:01:14] well, in that case, [15:01:19] * kormat gives marostegui the finger [15:01:26] (not sure if that needs translation :) [15:02:09] XDDDD [15:02:12] no need to [15:04:31] lol [15:25:02] 10DBA: Upgrade parsercache to Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T252182 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['pc2007.codfw.wmnet'] ` and were **ALL** successful. [17:24:42] 10DBA, 10Operations, 10ops-codfw: db2097 (backup source) restarted itself - https://phabricator.wikimedia.org/T252492 (10jcrespo) I resetup the host from backups. I am going to generate a logical backup (and a snapshot will be also generated later this day) and then send this to dc ops. [19:35:57] 10DBA, 10cloud-services-team (Kanban): Reimage labsdb1011 to Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T249188 (10Marostegui) The transfer has finished. I have started mysql on labsdb1011 and ran a mysql_upgrade --force as it wasn't run before. No errors were shown during startup or during t... [20:21:24] 10DBA, 10WMDE-Analytics-Engineering, 10Wikidata, 10Wikidata-Campsite, and 4 others: [Story] Monitor size of some Wikidata database tables - https://phabricator.wikimedia.org/T68025 (10Addshore) [20:22:42] 10DBA, 10WMDE-Analytics-Engineering, 10Wikidata, 10Wikidata-Campsite, and 4 others: [Story] Monitor size of some Wikidata database tables - https://phabricator.wikimedia.org/T68025 (10Addshore)