[06:30:25] 10DBA, 10Operations, 10Patch-For-Review: db1078 (s3 candidate master) crashed - https://phabricator.wikimedia.org/T209754 (10Marostegui) I have rebooted this host to see if there were any HW errors on boot-up, but it came back fine, no storage, memory or any other kind of error reported. [06:31:11] 10DBA, 10Operations, 10Patch-For-Review: db1078 (s3 candidate master) crashed - https://phabricator.wikimedia.org/T209754 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts: ` db1078.eqiad.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/201... [06:42:53] 10DBA, 10Operations, 10Patch-For-Review: db1078 (s3 candidate master) crashed - https://phabricator.wikimedia.org/T209754 (10Marostegui) This host also crashed a bit over a year ago: T173365 Even if I didn't find any trace of a real storage crash, this is what syslog shows 10 minutes before the crash: ` Nov... [07:12:12] 10DBA, 10Operations, 10Patch-For-Review: db1078 (s3 candidate master) crashed - https://phabricator.wikimedia.org/T209754 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db1078.eqiad.wmnet'] ` Of which those **FAILED**: ` ['db1078.eqiad.wmnet'] ` [07:12:22] 10DBA, 10Operations, 10Patch-For-Review: db1078 (s3 candidate master) crashed - https://phabricator.wikimedia.org/T209754 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts: ` db1078.eqiad.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/201... [07:12:24] 10DBA, 10Operations, 10Patch-For-Review: db1078 (s3 candidate master) crashed - https://phabricator.wikimedia.org/T209754 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db1078.eqiad.wmnet'] ` Of which those **FAILED**: ` ['db1078.eqiad.wmnet'] ` [07:12:44] 10DBA, 10Operations, 10Patch-For-Review: db1078 (s3 candidate master) crashed - https://phabricator.wikimedia.org/T209754 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts: ` db1078.eqiad.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/201... [07:12:47] 10DBA, 10Operations, 10Patch-For-Review: db1078 (s3 candidate master) crashed - https://phabricator.wikimedia.org/T209754 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db1078.eqiad.wmnet'] ` Of which those **FAILED**: ` ['db1078.eqiad.wmnet'] ` [07:13:02] 10DBA, 10Operations, 10Patch-For-Review: db1078 (s3 candidate master) crashed - https://phabricator.wikimedia.org/T209754 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts: ` db1078.eqiad.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/201... [07:31:28] 10DBA, 10Operations, 10Patch-For-Review: db1078 (s3 candidate master) crashed - https://phabricator.wikimedia.org/T209754 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db1078.eqiad.wmnet'] ` and were **ALL** successful. [08:35:05] marostegui: on db2095 I drop the user_optioins column again, to see if it breaks. I checked, it's master (db2076) doesn't have it, so it should be safe [08:35:20] ok [08:39:01] 10Blocked-on-schema-change, 10DBA, 10Patch-For-Review, 10Schema-change, 10User-Banyek: Dropping user.user_options on wmf databases - https://phabricator.wikimedia.org/T85757 (10Banyek) I try to drop the column again on db2095. As db2076 (master of db2095) doesn't have that column, it should be safe. [08:51:17] it broke again! [08:51:23] I check the binlog what is happening there [08:51:31] and why this could happen [08:51:43] ok [08:51:44] (the *why* part is the important one [08:52:54] 10Blocked-on-schema-change, 10DBA, 10Patch-For-Review, 10Schema-change, 10User-Banyek: Dropping user.user_options on wmf databases - https://phabricator.wikimedia.org/T85757 (10Banyek) it broke again. I start investigating the binlog from db2076 [09:09:10] exit [09:09:16] nothere [09:14:13] 10DBA, 10Data-Services, 10cloud-services-team (Kanban): Upgrade/reboot labsdb* servers - https://phabricator.wikimedia.org/T209517 (10Banyek) @Bstorm I am available, sorry for the late anwser I was ooo [09:34:40] 10DBA, 10Operations, 10ops-eqiad: Upgrade firmware on db1078 - https://phabricator.wikimedia.org/T209815 (10Marostegui) [09:34:49] 10DBA, 10Operations, 10ops-eqiad: Upgrade firmware on db1078 - https://phabricator.wikimedia.org/T209815 (10Marostegui) p:05Triage>03Normal [09:35:53] 10Blocked-on-schema-change, 10DBA, 10Patch-For-Review, 10Schema-change, 10User-Banyek: Dropping user.user_options on wmf databases - https://phabricator.wikimedia.org/T85757 (10Banyek) It's weird. The binlog event which can't be executed on db2095 is the following: `### UPDATE `frwiki`.`user` ### WHERE... [09:36:15] 10DBA, 10MediaWiki-extensions-FlaggedRevs, 10Wikimedia-Site-requests, 10User-Zoranzoki21: Drop FlaggedRevs tables in database for srwikinews - https://phabricator.wikimedia.org/T209761 (10jcrespo) @Zoranzoki21 just to be 100% sure everybody is aware of the consequences, dropping the tables means if at a la... [09:39:11] brb ~10min [09:40:09] 10Blocked-on-schema-change, 10DBA, 10Patch-For-Review, 10Schema-change, 10User-Banyek: Dropping user.user_options on wmf databases - https://phabricator.wikimedia.org/T85757 (10Marostegui) The column you are altering is part of the triggers. ` Statement: SET NEW.user_password = '', NEW.user_n... [09:49:16] 10DBA, 10foundation.wikimedia.org: Drop the petition_data table from production - https://phabricator.wikimedia.org/T208979 (10Marostegui) a:03Marostegui Table dropped from `testwiki` as it was empty. I have also renamed it on `foundationwiki` db1078 to make sure nothing reads from it. I will leave it like... [09:49:34] 10DBA, 10Operations, 10Patch-For-Review: db1078 (s3 candidate master) crashed - https://phabricator.wikimedia.org/T209754 (10Marostegui) a:03Marostegui [09:59:31] 10Blocked-on-schema-change, 10DBA, 10Patch-For-Review, 10Schema-change, 10User-Banyek: Dropping user.user_options on wmf databases - https://phabricator.wikimedia.org/T85757 (10Banyek) a:03Banyek [10:02:27] 10Blocked-on-schema-change, 10DBA, 10Patch-For-Review, 10Schema-change, 10User-Banyek: Dropping user.user_options on wmf databases - https://phabricator.wikimedia.org/T85757 (10Banyek) >>! In T85757#4757356, @Marostegui wrote: I'll also need to edit the file https://gerrit.wikimedia.org/r/plugins/gitile... [10:17:54] I do the trigger recreation now [10:26:01] 10Blocked-on-schema-change, 10DBA, 10Patch-For-Review, 10Schema-change, 10User-Banyek: Dropping user.user_options on wmf databases - https://phabricator.wikimedia.org/T85757 (10Banyek) The trigger maintenance will be executed on db2095 as: `SET SESSION sql_log_bin=0; -- DROP TRIGGER IF EXISTS frwiki.use... [10:42:21] 10Blocked-on-schema-change, 10DBA, 10Patch-For-Review, 10Schema-change, 10User-Banyek: Dropping user.user_options on wmf databases - https://phabricator.wikimedia.org/T85757 (10Banyek) Worked, the replication is not broke on db2095 anymore. S6 codfw is done. [10:47:52] 10DBA, 10Operations, 10Patch-For-Review: db1078 (s3 candidate master) crashed - https://phabricator.wikimedia.org/T209754 (10Marostegui) 05Open>03Resolved db1078 is now fully repooled after cloning it. This is all done. As a follow up with DCOps I have created {T209815} so we can have everything up to da... [10:59:38] 10Blocked-on-schema-change, 10DBA, 10Patch-For-Review, 10Schema-change, 10User-Banyek: Dropping user.user_options on wmf databases - https://phabricator.wikimedia.org/T85757 (10Banyek) [11:05:14] jynus: meeting? [12:29:11] 10DBA, 10Data-Services, 10cloud-services-team (Kanban): Upgrade/reboot labsdb* servers - https://phabricator.wikimedia.org/T209517 (10aborrero) >>! In T209517#4753108, @aborrero wrote: > > Will this new schedule work? > * labsdb1011.eqiad.wmnet scheduled 2018-11-19 13:00 UTC without announcement > * labsdb1... [14:01:24] 10DBA, 10Data-Services, 10cloud-services-team (Kanban): Upgrade/reboot labsdb* servers - https://phabricator.wikimedia.org/T209517 (10Banyek) Sorry I missed this, for some reasons there was a different time in my mind, so this is entirely my fault. What shall we do now? [14:10:44] 10DBA, 10Data-Services, 10cloud-services-team (Kanban): Upgrade/reboot labsdb* servers - https://phabricator.wikimedia.org/T209517 (10Banyek) can we do the labsdb1011.eqiad.wmnet today in a later time? The others I put into my calendar to avoid missing those [14:52:03] 10DBA, 10Cloud-Services: Prepare and check storage layer for shnwiki - https://phabricator.wikimedia.org/T206916 (10Banyek) I did the grant as ` set session sql_log_bin=0; create database shnwiki_p; Query OK, 0 rows affected (0.00 sec) Query OK, 1 row affected (0.00 sec) MariaDB [(none)]> set session sql_lo... [14:52:57] 10DBA, 10Cloud-Services: Prepare and check storage layer for yuewiktionary - https://phabricator.wikimedia.org/T205714 (10Banyek) I GRANT'ed: `MariaDB [(none)]> set session sql_log_bin=0; create database yuewiktionary_p; Query OK, 0 rows affected (0.01 sec) Query OK, 1 row affected (0.00 sec) MariaDB [(none)... [15:03:50] 10DBA, 10User-Banyek: Checking archive tables across the databases - https://phabricator.wikimedia.org/T209048 (10Banyek) >>! In T209048#4753191, @jcrespo wrote: > You only checked 2 servers on each comparison- you should check all of them- it takes approximately the same amount of time, speed it up with `--st... [15:14:49] 10DBA, 10Data-Services, 10cloud-services-team (Kanban): Upgrade/reboot labsdb* servers - https://phabricator.wikimedia.org/T209517 (10Bstorm) @Banyek I think as long as it works for you, and they are all on different days, it's fine for the wiki replicas. For labsdb1004 and 1005, 11/20 @ 17:15 for labsdb1... [15:18:19] 10DBA, 10Data-Services, 10cloud-services-team (Kanban): Upgrade/reboot labsdb* servers - https://phabricator.wikimedia.org/T209517 (10Banyek) yes, those dates are good for me, I'll put that to the calendar. On the missed labsdb 1011 I propse 2018-11-21 13:00 UTC then [16:22:48] 10DBA, 10Operations, 10ops-codfw: Decommission parsercache hosts: pc2006 pc2007 pc2008 - https://phabricator.wikimedia.org/T209858 (10Marostegui) [16:23:03] 10DBA, 10Operations, 10ops-codfw: Decommission parsercache hosts: pc2006 pc2007 pc2008 - https://phabricator.wikimedia.org/T209858 (10Marostegui) p:05Triage>03Normal [16:23:51] 10DBA, 10Operations, 10ops-codfw: Decommission parsercache hosts: pc2006 pc2007 pc2008 - https://phabricator.wikimedia.org/T209858 (10Marostegui) [16:28:53] 10DBA, 10Operations, 10ops-codfw: Decommission parsercache hosts: pc2004 pc2005 pc2006 - https://phabricator.wikimedia.org/T209858 (10Marostegui) [16:33:20] 10DBA, 10Operations, 10ops-codfw: Decommission parsercache hosts: pc2004 pc2005 pc2006 - https://phabricator.wikimedia.org/T209858 (10Marostegui) [16:34:11] 10DBA, 10Data-Services, 10cloud-services-team (Kanban): Upgrade/reboot labsdb* servers - https://phabricator.wikimedia.org/T209517 (10aborrero) >>! In T209517#4758290, @Banyek wrote: > On the missed labsdb 1011 I propse 2018-11-21 13:00 UTC then This works for me. Could we do both labsdb1011 and labsdb1010... [16:34:40] 10DBA, 10Data-Services, 10cloud-services-team (Kanban): Upgrade/reboot labsdb* servers - https://phabricator.wikimedia.org/T209517 (10aborrero) [16:35:29] 10DBA, 10Data-Services, 10cloud-services-team (Kanban): Upgrade/reboot labsdb* servers - https://phabricator.wikimedia.org/T209517 (10Banyek) >>! In T209517#4758640, @aborrero wrote: >>>! In T209517#4758290, @Banyek wrote: >> On the missed labsdb 1011 I propse 2018-11-21 13:00 UTC then > > This works for me... [16:41:44] 10DBA, 10Data-Services, 10cloud-services-team (Kanban): Upgrade/reboot labsdb* servers - https://phabricator.wikimedia.org/T209517 (10aborrero) >>! In T209517#4758642, @Banyek wrote: >>>! In T209517#4758640, @aborrero wrote: >>>>! In T209517#4758290, @Banyek wrote: >>> On the missed labsdb 1011 I propse 2018... [16:48:54] 10DBA, 10Data-Services, 10cloud-services-team (Kanban): Upgrade/reboot labsdb* servers - https://phabricator.wikimedia.org/T209517 (10Bstorm) [16:51:37] 10DBA, 10Data-Services, 10cloud-services-team (Kanban): Upgrade/reboot labsdb* servers - https://phabricator.wikimedia.org/T209517 (10Marostegui) >>! In T209517#4758674, @aborrero wrote: >>>! In T209517#4758642, @Banyek wrote: >>>>! In T209517#4758640, @aborrero wrote: >>>>>! In T209517#4758290, @Banyek wrot... [16:52:52] 10DBA, 10Data-Services, 10cloud-services-team (Kanban): Upgrade/reboot labsdb* servers - https://phabricator.wikimedia.org/T209517 (10aborrero) >>! In T209517#4758712, @Marostegui wrote: >>>! In T209517#4758674, @aborrero wrote: >> The scheduling we are talking about is 2018-11-21 13:00 UTC for labsdb1010/10... [16:53:18] 10DBA, 10Data-Services, 10cloud-services-team (Kanban): Upgrade/reboot labsdb* servers - https://phabricator.wikimedia.org/T209517 (10aborrero) [16:53:30] 10DBA, 10Data-Services, 10cloud-services-team (Kanban): Upgrade/reboot labsdb* servers - https://phabricator.wikimedia.org/T209517 (10Marostegui) Thanks for the understanding :-) [16:57:41] 10DBA, 10Data-Services, 10cloud-services-team (Kanban): Upgrade/reboot labsdb* servers - https://phabricator.wikimedia.org/T209517 (10Bstorm) [17:49:45] 10DBA, 10Operations, 10decommission, 10ops-codfw: Decommission parsercache hosts: pc2004 pc2005 pc2006 - https://phabricator.wikimedia.org/T209858 (10Marostegui) [17:58:22] I leave as I bring my son the doctor [18:39:03] 10DBA, 10Gerrit, 10Operations, 10Release-Engineering-Team (Next): Gerrit is failing to connect to db on gerrit2001 thus preventing systemd from working - https://phabricator.wikimedia.org/T176532 (10Paladox) In theory we could fix this with the upgrade to 2.16 (as nothing uses the db anymore but it's still... [18:39:14] 10DBA, 10Gerrit, 10Operations, 10Release-Engineering-Team (Next): Gerrit is failing to connect to db on gerrit2001 thus preventing systemd from working - https://phabricator.wikimedia.org/T176532 (10Paladox) [19:57:14] 10DBA, 10Cloud-Services, 10cloud-services-team (Kanban): Prepare and check storage layer for liwikinews - https://phabricator.wikimedia.org/T205713 (10Bstorm) [19:57:55] 10DBA, 10Cloud-Services, 10cloud-services-team (Kanban): Prepare and check storage layer for liwikinews - https://phabricator.wikimedia.org/T205713 (10Bstorm) a:03Bstorm [20:28:10] 10DBA, 10MediaWiki-extensions-WikibaseMediaInfo, 10SDC Engineering, 10StructuredDataOnCommons, 10Wikidata: MediaInfo extension should not use the wb_terms table - https://phabricator.wikimedia.org/T208330 (10Jdforrester-WMF) I've briefly looked into our code around this, and I'm afraid I don't see it as... [20:57:47] 10DBA, 10JADE, 10Operations, 10TechCom-RFC, 10Scoring-platform-team (Current): Introduce a new namespace for collaborative judgments about wiki entities - https://phabricator.wikimedia.org/T200297 (10awight) [20:59:17] 10DBA, 10Gerrit, 10Operations, 10Release-Engineering-Team (Next): Gerrit is failing to connect to db on gerrit2001 thus preventing systemd from working - https://phabricator.wikimedia.org/T176532 (10Paladox) Gerrit's db support is being removed in https://gerrit-review.googlesource.com/c/gerrit/+/205196 :) [21:16:15] 10DBA, 10JADE, 10Operations, 10Epic, and 2 others: [Epic] Extension:JADE scalability concerns - https://phabricator.wikimedia.org/T196547 (10Marostegui) >>! In T196547#4748654, @awight wrote: > This was addressed for now, by an agreement between our team and SRE to not install JADE on wikis with revision t... [21:19:27] 10DBA, 10JADE, 10Operations, 10Epic, and 2 others: [Epic] Extension:JADE scalability concerns - https://phabricator.wikimedia.org/T196547 (10awight) [21:20:20] 10DBA, 10JADE, 10Operations, 10Epic, and 2 others: [Epic] Extension:JADE scalability concerns - https://phabricator.wikimedia.org/T196547 (10awight) >>! In T196547#4759786, @Marostegui wrote: > There are some other big wikis (commons) where this is also a concern and some other agreements were made in orde... [21:23:15] 10Blocked-on-schema-change, 10DBA, 10Anti-Harassment (AHT Sprint 33), 10Patch-For-Review: Execute the schema change for Partial Blocks - https://phabricator.wikimedia.org/T204006 (10TBolliger)