[04:51:17] 10DBA, 10cloud-services-team (Kanban): Reimage labsdb1011 to Buster and 10.4 - https://phabricator.wikimedia.org/T249188 (10Marostegui) >>! In T249188#6095476, @Bstorm wrote: > `lang=shell-session > root@labsdb1011:~# hpssacli controller all show config detail | grep Firm > Firmware Version: 7.00 > ` > It's... [05:05:12] 10DBA, 10Operations: Make enabling reimaging for db hosts more humane - https://phabricator.wikimedia.org/T251392 (10Marostegui) You can use db1077 for this [05:08:17] 10DBA, 10Operations, 10Puppet, 10User-jbond: DB: perform rolling restart of mariadb daemons to pick up CA changes - https://phabricator.wikimedia.org/T239791 (10Marostegui) [05:08:22] 10DBA, 10Cognate, 10ContentTranslation, 10Growth-Team, and 10 others: Restart extension1 (x1) database primary master (db1120) - https://phabricator.wikimedia.org/T250701 (10Marostegui) 05Open→03Resolved This was done. The server was unaccessible from 05:00:41 to 05:02:02 [05:08:54] 10DBA, 10Operations, 10Puppet, 10User-jbond: DB: perform rolling restart of mariadb daemons to pick up CA changes - https://phabricator.wikimedia.org/T239791 (10Marostegui) [05:15:18] 10DBA: Drop wb_terms in production from s4 (commonswiki, testcommonswiki), s3 (testwikidatawiki), s8 (wikidatawiki) - https://phabricator.wikimedia.org/T248086 (10Marostegui) [05:30:28] 10DBA: Drop wb_terms in production from s4 (commonswiki, testcommonswiki), s3 (testwikidatawiki), s8 (wikidatawiki) - https://phabricator.wikimedia.org/T248086 (10Marostegui) [05:31:24] 10DBA, 10cloud-services-team (Kanban): Reimage labsdb1011 to Buster and 10.4 - https://phabricator.wikimedia.org/T249188 (10Marostegui) It keeps crashing [06:38:46] 10DBA, 10Operations: Make enabling reimaging for db hosts more humane - https://phabricator.wikimedia.org/T251392 (10jcrespo) If you accept some input, making `partman/custom/no-srv-format.cfg` a recipe that works but doesn't touch the /srv lvm partition would solve most of our problems (combined with the dyna... [06:51:15] 10DBA: Drop wb_terms in production from s4 (commonswiki, testcommonswiki), s3 (testwikidatawiki), s8 (wikidatawiki) - https://phabricator.wikimedia.org/T248086 (10Marostegui) [06:53:52] Amir1: The wb_terms table cannot be removed from the analytics dbstore, no? [06:54:07] It can be removed from labs, but also from dbstore? [06:55:15] I checked your comment on Monday, and you confirm it can be deleted from labs, but not from dbstore [06:58:41] I am starting a dump on labsdb1009, just in case [07:00:17] I can stop it if it causes load problems [07:01:00] cool [07:01:04] thanks [07:01:40] it is on a screen, locally to labsdb1009, in case I am not around [07:01:45] marostegui: it can be dropped for dbstore. No researcher uses it [07:02:29] excellent! [07:02:36] thank you [07:04:55] Thank you for saving the databases. I just sent a couple of emails [07:08:46] 10 minutes after labsdb backup started and it is still only getting metadata [07:08:50] :-/ [07:09:04] jynus: :_( [07:15:01] 10DBA, 10Epic: Upgrade WMF database-and-backup-related hosts to buster - https://phabricator.wikimedia.org/T250666 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts: ` ['db2089.codfw.wmnet'] ` The log can be found in `/var/log/wmf-auto-reimage/20200430... [07:29:16] 2020-04-30 06:58:44 [WARNING] - Broken table detected, please review: commonswiki_p.wb_terms_no_longer_updated [07:29:33] and so with other 3 [07:29:37] yes, we need to recreate the views once we've dropped everything [07:29:42] I see [07:29:50] so they are just view that point to nothing, right? [07:29:54] yeah [07:29:56] ok [07:30:00] I thought it was something wose [07:30:02] *worse [07:30:08] they need to be created for commonswiki, testcommonswiki, testwikidatawiki, and wikidatawiki (not yet) [07:30:14] yeah, no issue [07:30:16] So I am waiting for it before going for all the wikis [07:30:35] "Broken table detected" was a bit scary msg [07:30:37] :-D [07:31:05] haha [07:32:44] it is dumping now, it may take 12+ hours [07:35:17] 10DBA, 10Epic: Upgrade WMF database-and-backup-related hosts to buster - https://phabricator.wikimedia.org/T250666 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db2089.codfw.wmnet'] ` and were **ALL** successful. [07:35:51] lag is growing, however [07:35:59] is that ok? [07:36:12] I think we'll need to live with it [07:39:34] 10DBA, 10Upstream: Events set to SLAVESIDE_DISABLED when upgrading from 10.1 to 10.4 - https://phabricator.wikimedia.org/T247728 (10Marostegui) [09:10:34] 10DBA, 10Operations: Upgrade and restart s3 and s7 primary DB master: Thu 7th May - https://phabricator.wikimedia.org/T251158 (10Marostegui) [09:10:44] 10DBA, 10Operations: Upgrade and restart s5 and s6 primary DB master: Tue 5th May - https://phabricator.wikimedia.org/T251154 (10Marostegui) [09:15:41] 10DBA, 10Operations: Upgrade and restart s4 (commonswiki) primary database master: Tue 12th May - https://phabricator.wikimedia.org/T251502 (10Marostegui) [09:16:02] 10DBA, 10Operations: Upgrade and restart s4 (commonswiki) primary database master: Tue 12th May - https://phabricator.wikimedia.org/T251502 (10Marostegui) p:05Triage→03Medium [09:16:25] 10DBA, 10Operations: Upgrade and restart s4 (commonswiki) primary database master: Tue 12th May - https://phabricator.wikimedia.org/T251502 (10Marostegui) [09:22:03] 10DBA, 10Operations: Upgrade and restart s5 and s6 primary DB master: Tue 5th May - https://phabricator.wikimedia.org/T251154 (10Marostegui) Day before: - Install the 10.1.43-2 package on both masters Maintenance day: - Silence all hosts in s5 and s6 - Set read only on s5 and s6: ` dbctl --scope eqiad section... [09:35:06] 10DBA, 10Operations: Make enabling reimaging for db hosts more humane - https://phabricator.wikimedia.org/T251392 (10Kormat) @jcrespo : i'm happy to work on that, but i'd like to do the proposed change in this task first. partman is voodoo anytime i've touched it, so it will take some time and some care to cha... [09:38:37] 10DBA, 10Operations: Make enabling reimaging for db hosts more humane - https://phabricator.wikimedia.org/T251392 (10jcrespo) I see now, sorry, I didn't understood the proposed scope of work first time I read it. +10000 for me. [09:40:05] marostegui: db1077 - i see it's in `test-s4`, and has a bunch of dbs on it, but no replication. i don't intend to touch `/srv`, but what's the impact if i wipe it by mistake? [09:41:53] kormat: should be fine - it was a test host that was used for a refactoring query testing. we recently got in contact with the users and they gave us green light to repurpose the hosts. we just left db1077 for the sake of leaving it. its BBU is broken so it cannot go into production [09:42:10] define:BBU? [09:42:21] Batery-backup unit [09:42:25] ahh, ok. [09:42:30] the battery that powers the raid [09:42:30] for the raid contoller, i assume [09:42:33] gotcha [09:42:34] yep [09:42:57] ok, i'll go nuke it a bit :) [09:43:12] I am going to guess we are quite more "full stack" here that in other places :-D [09:43:38] the host has notifications disabled, but double check that [09:43:38] from replacing BBUs to creating frameworks, we have to do it all :-DDDD [09:43:46] marostegui: i did :) [09:44:05] if you also want to downtime it for a week that won't hurt :-) [09:45:35] `icinga-downtime -h db1077 -d 604800 -r "DB reimage testing T251392"` done [09:45:36] T251392: Make enabling reimaging for db hosts more humane - https://phabricator.wikimedia.org/T251392 [09:46:12] \o/ [09:49:37] marostegui: https://gerrit.wikimedia.org/r/c/operations/puppet/+/593471 for you [09:50:22] good luck :p [09:51:15] on the good news, labsdb1009 lag may be recovering [09:51:44] dewiki and commons wiki dumping now [09:57:15] hum. i'm unable to authenticate to the mgmt interface on db1077. it doesn't like the management password [09:59:16] let me try, but it is highly likely it got desynced [10:00:13] ah, it's a hp, not a dell. i don't know if that makes a difference [10:00:19] is it still `root`? [10:00:22] no, it shouldn't [10:00:23] yeah [10:00:38] You can check: https://wikitech.wikimedia.org/wiki/Management_Interfaces#Troubleshooting_Commands [10:00:45] THis is always good for IPMI troubleshooting [10:00:49] kormat: it worked for me [10:00:53] but took some time [10:01:01] net or other issues, maybe [10:01:21] works for me too [10:01:28] so just to be clear: `ssh root@db1077.mgmt.eqiad.wmnet` [10:01:35] yeah, some of those servers had switch issues [10:01:37] yes [10:01:42] try again after meeting :-D [11:16:00] labsdb1009 dump around 1/3 of the way. dewiki, commons dumped, enwiki ongoing [11:31:54] 10DBA, 10cloud-services-team (Kanban): Reimage labsdb1011 to Buster and 10.4 - https://phabricator.wikimedia.org/T249188 (10Marostegui) [11:34:04] 10DBA, 10Cognate, 10ContentTranslation, 10Growth-Team, and 9 others: Restart extension1 (x1) database primary master (db1120) - https://phabricator.wikimedia.org/T250701 (10Johan) [11:38:24] for some reason mgmt access worked on like the 8th try. v0v [11:39:41] `textcons` complains that a license is necessary [11:41:23] oh, but `vsp` works. i assume that's sufficient. [11:43:19] I believe we do have the licenses, so that's an error [11:43:48] vsp is the serial port (ttyS1, which we open a getty on), textcons is the VGA console (tty0, also has a getty) [11:43:52] 10DBA, 10Operations: Make enabling reimaging for db hosts more humane - https://phabricator.wikimedia.org/T251392 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by kormat on cumin1001.eqiad.wmnet for hosts: ` ['db1077.eqiad.wmnet'] ` The log can be found in `/var/log/wmf-auto-reimage/202004301143_... [11:44:26] paravoid: it looks like the bios is configured to use the serial port, too, so probably `vsp` is sufficient for my needs [11:45:27] yeah, bios redirection + grub + kernel + getty should all be configured [11:45:45] grand. thanks :) [11:45:50] the latter three are configuration-managed so it should be consistent [11:46:04] the bios settings are still manual, so more error-prone [11:46:04] paravoid: see, i told you you were the more technical engineering director :) [12:05:04] 10DBA, 10Operations: Make enabling reimaging for db hosts more humane - https://phabricator.wikimedia.org/T251392 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db1077.eqiad.wmnet'] ` and were **ALL** successful. [12:14:44] 10DBA, 10Operations: Make enabling reimaging for db hosts more humane - https://phabricator.wikimedia.org/T251392 (10Kormat) Success: using `db1007) ;; \` in `netboot.cfg` achieved the (short-term) goal of allowing us to use manual partitioning. [12:33:31] 10DBA, 10Dumps-Generation, 10MediaWiki-extensions-CodeReview, 10Security-Team: Publish SQL dumps of CodeReview tables - https://phabricator.wikimedia.org/T243055 (10ArielGlenn) labstore1006.wikimedia.org and labstore1007.wikimedia.org in /srv/dumps/xmldatadumps/public/other let's make the subdirectory co... [12:59:20] 10DBA, 10Patch-For-Review, 10cloud-services-team (Kanban): Reimage labsdb1011 to Buster and 10.4 - https://phabricator.wikimedia.org/T249188 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts: ` ['labsdb1011.eqiad.wmnet'] ` The log can be found in `/v... [13:23:05] 10DBA, 10cloud-services-team (Kanban): Reimage labsdb1011 to Buster and 10.4 - https://phabricator.wikimedia.org/T249188 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['labsdb1011.eqiad.wmnet'] ` and were **ALL** successful. [13:53:24] 10DBA: Drop wb_terms in production from s4 (commonswiki, testcommonswiki), s3 (testwikidatawiki), s8 (wikidatawiki) - https://phabricator.wikimedia.org/T248086 (10Marostegui) [14:32:22] 10DBA, 10Dumps-Generation, 10MediaWiki-extensions-CodeReview, 10Security-Team: Publish SQL dumps of CodeReview tables - https://phabricator.wikimedia.org/T243055 (10jcrespo) a:05jcrespo→03ArielGlenn ` root@labstore1006:/srv/dumps/xmldatadumps/public/other$ ls -lR codereview/ codereview/: total 4 drwxr-... [16:02:19] interesting: https://jira.mariadb.org/browse/MDEV-12289 [16:12:44] if my calculations are correct, we are dropping 15TB from production databases (not to mention backups, etc.) [16:43:17] 10DBA, 10Privacy Engineering, 10Security-Team: Drop (and archive?) aft_feedback - https://phabricator.wikimedia.org/T250715 (10JFishback_WMF) Hi @jcrespo the #security-team reviewed this table and we're fine with making the data public with the caveat that anything where `aft_hide = 1` is **removed** from th... [16:45:04] 10DBA, 10Privacy Engineering, 10Security-Team: Drop (and archive?) aft_feedback - https://phabricator.wikimedia.org/T250715 (10JFishback_WMF) a:05JFishback_WMF→03None [16:56:20] 10DBA, 10Wikidata, 10Wikidata-Campsite, 10Wikimedia-Rdbms, and 3 others: LoadBalancer: Transaction spent [n] second(s) in writes, exceeding the limit of [n] - https://phabricator.wikimedia.org/T251457 (10brennen) Adding DBA in case this reflects load balancing issues. [17:05:59] 10DBA, 10Wikidata, 10Wikidata-Campsite, 10Wikimedia-Rdbms, and 3 others: LoadBalancer: Transaction spent [n] second(s) in writes, exceeding the limit of [n] - https://phabricator.wikimedia.org/T251457 (10Marostegui) It is just a big transaction that takes more than the limit we have for writes, which is 3... [17:11:34] 10DBA, 10Wikidata, 10Wikidata-Campsite, 10Wikimedia-Rdbms, and 3 others: LoadBalancer: Transaction spent [n] second(s) in writes, exceeding the limit of [n] - https://phabricator.wikimedia.org/T251457 (10brennen) p:05Unbreak!→03High Ok, per @Marostegui and conversation in RelEng I'm removing this as a... [17:11:42] 10DBA, 10Wikidata, 10Wikidata-Campsite, 10Wikimedia-Rdbms, and 3 others: LoadBalancer: Transaction spent [n] second(s) in writes, exceeding the limit of [n] - https://phabricator.wikimedia.org/T251457 (10brennen) [17:13:08] 10DBA, 10Wikidata, 10Wikidata-Campsite, 10Wikimedia-Rdbms, and 3 others: LoadBalancer: Transaction spent [n] second(s) in writes, exceeding the limit of [n] - https://phabricator.wikimedia.org/T251457 (10Marostegui) This is preventing writes from happening with that particular action,so it must indeed be f... [18:25:53] 10DBA, 10Wikidata, 10Performance Issue, 10User-Addshore, 10Wikimedia-production-error: Repeated WMFTimeoutException in Wikidata - https://phabricator.wikimedia.org/T250115 (10Addshore) 05Open→03Resolved a:03Addshore This was due to a DB mistakenly being depooled. On Monday the intention was 1 dB wo...