[04:13:57] 10DBA, 10Patch-For-Review, 10Upstream, 10cloud-services-team (Kanban): Reimage labsdb1011 to Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T249188 (10Marostegui) And the host finally crashed: ` May 27 02:11:02 labsdb1011 mysqld[23527]: 2020-05-27 02:11:02 0x7fc65c207700 InnoDB: Assertion fai... [04:18:16] 10DBA, 10Patch-For-Review, 10Upstream, 10cloud-services-team (Kanban): Reimage labsdb1011 to Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T249188 (10Marostegui) @Kormat @jcrespo db1141 has caught up with replication, let's get mysqldump from it and attempt a restart? I am thinking to reimag... [04:24:02] 10Blocked-on-schema-change, 10DBA: Apply Babel schema change expanding babel_lang in Wikimedia production - https://phabricator.wikimedia.org/T253342 (10Marostegui) [04:24:15] 10Blocked-on-schema-change, 10DBA: Apply Babel schema change expanding babel_lang in Wikimedia production - https://phabricator.wikimedia.org/T253342 (10Marostegui) 05Open→03Resolved All done [05:13:38] 10DBA: tendril_purge_global_status_log_5m and global_status_log needs more frequent purging - https://phabricator.wikimedia.org/T252331 (10Marostegui) 05Open→03Resolved a:03Marostegui So this has been addressed by adding 2 events events that purge `global_status_log_5m` and `global_status_log` tables and k... [05:13:40] 10DBA: Disable/remove unused features on Tendril - https://phabricator.wikimedia.org/T231185 (10Marostegui) [05:59:19] it seems toolsdb OOM'd (clouddb1001.clouddb-services.eqiad.wmflabs) [05:59:48] was totally frozen, had to reboot it by hand [06:00:34] marostegui: I may need some help to start mysql again [06:09:45] I created /var/run/mysqld/ by hand [06:13:50] mmm in the slave DB I see some ERROR log entries mentioning slave replication: [06:13:52] https://www.irccloud.com/pastebin/HGOVyPcG/ [06:15:24] perhaps I shoudl do what it suggest: `START SLAVE;` or something [06:18:00] the secondary being `clouddb1002.clouddb-services.eqiad.wmflabs` in this case [06:21:54] several tables are marked as crashed in the primary DB and should be repaired -- per reports in the logs [06:33:42] arturo: I will be with you in 10 minutes [06:33:54] walking back home after running [06:41:03] arturo: back [06:42:10] so clouddb1001 is the master and it crashed, right? [06:43:00] ah these are the old toolsdb hosts? [06:43:15] I am seeing the replication filters on clouddb1002 [06:43:27] and those look like the old toolsdb hosts where we didn't replicate certain users [06:47:49] I am fixing the dup entry errors [06:57:59] arturo: I think it is all fixed now and replication is flowing again [07:55:31] 10DBA, 10MediaWiki-General, 10TechCom-RFC, 10Performance-Team (Radar): RFC: Discourage use of MySQL's ENUM type - https://phabricator.wikimedia.org/T119173 (10Nikerabbit) [07:58:11] marostegui: Announced https://www.wikidata.org/wiki/Wikidata:Project_chat#Special:FewestRevisions_to_be_disabled [08:18:58] 10DBA, 10DC-Ops, 10Operations, 10ops-eqiad: db1140 (backup source) crashed - https://phabricator.wikimedia.org/T250602 (10jcrespo) Hi, @wiki_willy I just want to ping you so your team is aware that the maintenance here didn't complete correctly and that we need more onsite help (I don't need this fast, jus... [08:22:49] thanks marostegui !! [09:01:58] arturo: there are more issues, I am going to try to fix those too [09:13:29] what issues? can I help with anything? [09:14:41] arturo: more replication issues, looks like stuff got corrupted on the master crash [09:15:38] arturo: I am not sure it is worth the time fixing them, at least on toolsdb consistency wasn't always warranted [09:27:44] 10DBA, 10Patch-For-Review, 10Upstream, 10cloud-services-team (Kanban): Reimage labsdb1011 to Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T249188 (10Marostegui) Taking a `mydumper` from db1141 using: ` root@db1141:/srv# /usr/bin/mydumper --compress --events --triggers --routines --logfile /s... [09:30:07] 10DBA, 10Patch-For-Review, 10Upstream, 10cloud-services-team (Kanban): Reimage labsdb1011 to Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T249188 (10jcrespo) > host localhost Will it fit? [09:34:28] 10DBA, 10Patch-For-Review, 10Upstream, 10cloud-services-team (Kanban): Reimage labsdb1011 to Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T249188 (10Marostegui) Yep, it has 2.1T available and the dump on labsdb1009 is around 880GB [09:40:56] 10DBA: Package transferpy framework under wmfmariadbpy - https://phabricator.wikimedia.org/T253736 (10Privacybatm) [09:41:42] 10DBA: Package transferpy framework under wmfmariadbpy - https://phabricator.wikimedia.org/T253736 (10Privacybatm) p:05Triage→03Medium [09:41:55] shall we open a phab task marostegui ? [09:42:21] arturo: up to you, but I don't have much time to deal with this unfortunately [09:42:38] that's OK. Anyway, if we are going to let some data die, we should probably create a phab task and send an email to the users [09:42:53] arturo: So far I haven't, I have fixed all the stuff without loosing data [09:43:16] so is just the replication? [09:44:24] yes, only happens on the slave [09:44:27] which I guess it is not used [09:44:34] or at least on toolsdb it was just kept for "backups" [09:44:47] although we warned users that they should keep their own backups too [09:45:00] that was years ago, not sure how it is now since it was moved under your belt :) [09:45:32] probably the same ^^U [09:46:07] would fixing replication for good involve cloning the data, or doing a sync from scratch or something? [09:46:24] not necessarily [09:46:32] and definitely not all the data probably [09:46:39] just a few tables, which maybe they are not even worth [09:46:44] but that's not for me to make the call [09:46:50] so far replication is working again [09:46:58] I have fixed the inconsistences [09:49:56] ok, so now I'm not sure how to describe the current status (for the phab task I'm filling) [09:51:29] T253738 [09:51:29] T253738: ToolsDB: master crashed, replica having consistency issues - https://phabricator.wikimedia.org/T253738 [09:59:00] commented, and now getting into a meeting [10:01:28] thanks! [10:53:29] it would be interesting to try to understand why the server crashed in the first place. Can mariadb eat all the memory just like that? [11:07:27] oom are logged by kernel, you should check if that happened [11:28:15] I created a doc for tendril, I have shared it with you [11:28:23] thanks [11:28:52] I don't want to monopolize it, but I also know you are busy [11:28:59] and I volunteered to do it [11:29:09] so I will work on it [11:29:33] I started with what I think is the things we agree on 0:-D [11:40:12] i'm looking at Orchestrator a bit [11:40:27] we should just get that deployed I think, even in a very limited/read-only mode at first [11:54:42] 10DBA, 10MediaWiki-General, 10TechCom-RFC, 10Performance-Team (Radar): RFC: Discourage use of MySQL's ENUM type - https://phabricator.wikimedia.org/T119173 (10Nikerabbit) @Ladsgroup I think what you wrote above should be in the task description as well, especially the part that it is not supported by all D... [11:55:45] I will add mark to the doc too [11:58:16] i already like orchestrator because it's written in go :) [11:59:27] i just said that to manuel ;) [11:59:47] I think it is early to say [12:00:16] we should explain what the actual problem is and more than 1 option with is advantages and disadvantages [12:00:23] *its [12:00:48] or a short term and a long term plan/options [12:02:55] 10DBA, 10MediaWiki-General, 10TechCom-RFC, 10Performance-Team (Radar): RFC: Discourage use of MySQL's ENUM type - https://phabricator.wikimedia.org/T119173 (10Nikerabbit) [12:43:13] 10DBA: Productionize db213[6-9] and db2140 - https://phabricator.wikimedia.org/T252985 (10Kormat) db2138 is set up, and currently catching up on replication before it gets pooled into s2+s4. [13:16:20] Amir1: Sorry I missed your message, I was buried into lots of stuff [13:16:45] Amir1: I have read it now, thank you :) [13:17:15] no worries. I think a patch is already there, you can just merge it [13:17:18] let me check [13:17:41] Amir1: the funny thing is that this is hitting us RIGHT NOW again [13:17:44] I am going to kill it again [13:17:59] ugh [13:18:04] thanks [13:18:05] https://gerrit.wikimedia.org/r/c/operations/puppet/+/596172 [13:18:37] also I think you need to manually remove the cron as well (because this patch doesn't remove it, we need to make it absent first) [13:20:10] Is that ready to be merged? [13:22:48] Yes [13:22:53] ok, going to merge [13:23:01] the announcement is done [13:23:15] you also need to manually remove the cron too :( [13:23:24] yep [13:23:57] I cannot submit, maybe apergos has to [14:16:55] 10DBA: Productionize db114[1-9] - https://phabricator.wikimedia.org/T252512 (10Marostegui) [14:18:18] i am enjoying jynus's opening comment on T224589: "This should be easy ..." [14:18:20] T224589: Migrate dbmonitor hosts to Stretch/Buster - https://phabricator.wikimedia.org/T224589 [14:18:43] haha another case of famous last words! [14:18:57] we are guilty of that around here [14:19:18] and what's worse - these words persist as a function of us doing our job ;) [14:21:35] kormat: last time the biggest issue was apache mod rewrite changes [14:21:51] so this time it was supposed to be easy (puppet wise) [14:21:54] and it was [14:22:02] the application not starting on the other side... [14:25:16] :) [14:42:47] I'm about to do a dbctl release [14:42:53] starting with cumin2001 [14:43:09] and then will briefly introduce a diff to test the new diffing [14:43:45] cdanis: i got `colordiff` and `icdiff` installed on all hosts earlier, which should help showcase your improvements :) [14:43:55] ahah [14:44:05] yeah I just noticed icdiff was already installed :) [14:46:09] btw kormat, did I ever show you https://gerrit.wikimedia.org/r/c/operations/puppet/+/557085 ? [14:46:31] my heart-rate always goes up when you ask me these questions. looking :P [14:47:30] oh wow [14:49:53] kormat: https://phabricator.wikimedia.org/P11320 [14:50:07] we don't just use icdiff because ... well, it's not packaged as a python package [14:50:08] sigh [14:50:33] but this is better at least. [14:50:54] great :) [14:51:28] (i don't think icdiff would be a good default. having an actual apply-able diff is a useful attribute of the current system) [14:52:14] well, it is the default on the console if it was installed via pip or similar ;) but that isn't true in practice here. but also, the export to phabricator when you commit is always in unified format, not icdiff [14:52:48] ic [14:55:12] ic wydt [14:56:09] 10DBA, 10DC-Ops, 10Operations, 10ops-eqiad: db1140 (backup source) crashed - https://phabricator.wikimedia.org/T250602 (10wiki_willy) Thanks @jcrespo . I don't think @Jclark-ctr has been onsite at the data center since the last update, but I'll follow up with him on this when he's out there this week. T... [14:57:24] o/ just wondering if it was known why 1087:9104 was laggy this morning? [14:57:44] oh, unless that was that special page again [15:14:28] s8? [15:18:59] oh kormat I forgot to mention it in the changelog but this release also includes https://gerrit.wikimedia.org/r/c/operations/software/conftool/+/597318 [15:19:37] ah y [15:27:40] addshore: our friend the cronjob [15:27:48] I killed it and removed it [17:40:14] 10DBA: Make partman/custom/no-srv-format.cfg work - https://phabricator.wikimedia.org/T251768 (10ArielGlenn) I have the recipe dumpsdata100X-no-data-format.cfg which does less than it should (but at least doesn't format the array). I'd love a fully functional solution. [20:34:47] 10DBA, 10Operations, 10ops-eqiad: db1138 (s4 master) crashed due to memory issues - https://phabricator.wikimedia.org/T253808 (10Marostegui) [20:37:13] 10DBA, 10Operations, 10ops-eqiad: db1138 (s4 master) crashed due to memory issues - https://phabricator.wikimedia.org/T253808 (10CDanis) p:05Triage→03High [20:39:36] 10DBA, 10Operations, 10ops-eqiad: db1138 (s4 master) crashed due to memory issues - https://phabricator.wikimedia.org/T253808 (10Marostegui) I am running a compare from this host to its candidate master (db1081) to make sure we are good for Friday. [20:41:23] 10DBA, 10Operations, 10ops-eqiad, 10Wikimedia-Incident: db1138 (s4 master) crashed due to memory issues - https://phabricator.wikimedia.org/T253808 (10Marostegui) [20:41:58] 10DBA, 10MediaWiki-General, 10TechCom-RFC, 10Performance-Team (Radar): RFC: Discourage use of MySQL's ENUM type - https://phabricator.wikimedia.org/T119173 (10Krinkle) >>! In T119173#6154249, @Ladsgroup wrote: > - Not all DBMSes support ENUM, for example sqlite that we officially support doesn't have enum... [20:57:22] btw added to dbctl docs: https://wikitech.wikimedia.org/w/index.php?title=Dbctl&type=revision&diff=1867593&oldid=1851699 [21:04:58] I think I added a script for that [21:05:01] to our repo [21:05:04] will check tomorrow [21:05:12] I should be sleeping [21:05:19] yes, go off [21:12:59] 10DBA, 10Operations, 10ops-eqiad, 10Wikimedia-Incident: db1138 (s4 master) crashed due to memory issues - https://phabricator.wikimedia.org/T253808 (10wiki_willy) a:03Jclark-ctr [21:15:15] 10DBA, 10Operations, 10ops-eqiad, 10Wikimedia-Incident: db1138 (s4 master) crashed due to memory issues - https://phabricator.wikimedia.org/T253808 (10wiki_willy) @Marostegui - will do, Papaul and John are working on pulling the TSR right now for the RMA. Thanks, Willy [23:31:13] 10DBA, 10Operations, 10ops-eqiad, 10Wikimedia-Incident: db1138 (s4 master) crashed due to memory issues - https://phabricator.wikimedia.org/T253808 (10Jclark-ctr) Sent TSR report to Dell Confirmed: Service Request 1025886499 was successfully submitted.