[00:42:03] PROBLEM - MariaDB sustained replica lag on db2071 is CRITICAL: 82 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2071&var-port=9104 [00:43:25] 10DBA, 10DC-Ops, 10SRE, 10ops-eqiad: (Need By: 2020-11-29) rack/setup/install db11[51-76] - https://phabricator.wikimedia.org/T267043 (10RobH) [x] db1166-db1176 (exceptions: db117[01]) have all had their default passwords changed to the idrac mgmt password. [] Chris is going to check out db117[01] tomorro... [00:43:55] RECOVERY - MariaDB sustained replica lag on db2071 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2071&var-port=9104 [00:49:24] 10DBA, 10DC-Ops, 10SRE, 10ops-eqiad: (Need By: 2020-11-29) rack/setup/install db11[51-76] - https://phabricator.wikimedia.org/T267043 (10RobH) [01:14:52] 10DBA, 10DC-Ops, 10SRE, 10ops-eqiad: (Need By: 2020-11-29) rack/setup/install db11[51-76] - https://phabricator.wikimedia.org/T267043 (10RobH) [01:40:01] 10DBA, 10DC-Ops, 10SRE, 10ops-eqiad: (Need By: 2020-11-29) rack/setup/install db11[51-76] - https://phabricator.wikimedia.org/T267043 (10RobH) [01:41:44] 10DBA, 10DC-Ops, 10SRE, 10ops-eqiad: (Need By: 2020-11-29) rack/setup/install db11[51-76] - https://phabricator.wikimedia.org/T267043 (10RobH) John was onsite and fixed db117[01] for me, they are now online. db11[56-65] have had bios and idrac firmware updates, and raid setup. I've updated the task descr... [01:42:23] 10DBA, 10DC-Ops, 10SRE, 10ops-eqiad: (Need By: 2020-11-29) rack/setup/install db11[51-76] - https://phabricator.wikimedia.org/T267043 (10RobH) [06:06:16] 10Blocked-on-schema-change, 10DBA: Increase size of slot_roles.role_id - https://phabricator.wikimedia.org/T270054 (10Marostegui) [06:06:32] 10Blocked-on-schema-change, 10DBA: Increase size of slot_roles.role_id - https://phabricator.wikimedia.org/T270054 (10Marostegui) 05Open→03Resolved All done [06:06:54] 10DBA: Grant "sockpuppet_import" user INDEX on "sockpuppet" database - https://phabricator.wikimedia.org/T272533 (10Marostegui) a:03Marostegui [06:19:44] 10DBA, 10SRE, 10ops-eqiad: Memory errors on clouddb1019 - https://phabricator.wikimedia.org/T272125 (10Marostegui) @Cmjohnson unfortunately the server isn't accessible yet - I cannot even reach its idrac :-( ` root@cumin1001:~# ping clouddb1019.eqiad.wmnet -c5 PING clouddb1019.eqiad.wmnet (10.64.48.9) 56(84)... [06:22:50] 10DBA, 10Patch-For-Review: Grant "sockpuppet_import" user INDEX on "sockpuppet" database - https://phabricator.wikimedia.org/T272533 (10Marostegui) 05Open→03Resolved This is done ` root@db1107.eqiad.wmnet[(none)]> show grants for `sockpuppet_import`@`10.64.16.19`; show grants for `sockpuppet_import`@`10.64... [06:39:50] 10Blocked-on-schema-change: Alter objectcache.exptime - https://phabricator.wikimedia.org/T272512 (10Marostegui) p:05Triage→03Medium @Ladsgroup this table also on parsercache, but it is also empty, so not sure if that needs to be altered too? [06:40:14] 10Blocked-on-schema-change: Drop default of oldimage.oi_timestamp - https://phabricator.wikimedia.org/T272511 (10Marostegui) p:05Triage→03Medium [06:58:08] 10DBA, 10cloud-services-team (Kanban): Move wikireplicas under the new sanitarium hosts (db1154, db1155) - https://phabricator.wikimedia.org/T272008 (10Marostegui) clouddb1016:3318 and clouddb1020:3318 moved. [06:58:22] 10DBA, 10cloud-services-team (Kanban): Move wikireplicas under the new sanitarium hosts (db1154, db1155) - https://phabricator.wikimedia.org/T272008 (10Marostegui) [07:23:10] 10Blocked-on-schema-change: Alter objectcache.exptime - https://phabricator.wikimedia.org/T272512 (10Marostegui) [07:23:24] 10Blocked-on-schema-change, 10DBA: Alter objectcache.exptime - https://phabricator.wikimedia.org/T272512 (10Marostegui) [07:34:47] 10Blocked-on-schema-change, 10DBA: Schema change for timestamp field of uploadstash - https://phabricator.wikimedia.org/T270055 (10Marostegui) a:03Marostegui [07:40:00] 10Blocked-on-schema-change, 10DBA: Schema change for timestamp field of uploadstash - https://phabricator.wikimedia.org/T270055 (10Marostegui) Altered db2114 on s6, leaving it for a few hours to make sure nothing breaks (even if the table is empty there) ` # for i in frwiki jawiki ruwiki; do mysql.py -hdb2114... [07:44:19] 10Blocked-on-schema-change, 10DBA: Alter objectcache.exptime - https://phabricator.wikimedia.org/T272512 (10Ladsgroup) >>! In T272512#6764174, @Marostegui wrote: > @Ladsgroup this table also on parsercache, but it is also empty, so not sure if that needs to be altered too? Maybe it should be dropped? but I'm... [07:50:51] 10DBA, 10Wikimedia-Etherpad: Upgrade and restart m1 master (db1080) - https://phabricator.wikimedia.org/T271540 (10Marostegui) Reminder: This is happening in around 1h [07:51:36] 10Blocked-on-schema-change, 10DBA: Alter objectcache.exptime - https://phabricator.wikimedia.org/T272512 (10Marostegui) >>! In T272512#6764226, @Ladsgroup wrote: >>>! In T272512#6764174, @Marostegui wrote: >> @Ladsgroup this table also on parsercache, but it is also empty, so not sure if that needs to be alter... [08:43:31] ready to stop bacula daemon when you want it (not jobs running at the moment) [08:44:20] let's do that at .50? [08:44:39] akosiaris: will you be around for the etherpad (most likely needed) restart? [08:44:47] cool [08:45:01] let me downtime alerts for now [08:45:18] (bacula ones) [09:02:02] marostegui: /me around [09:02:25] akosiaris: all done :) [09:02:25] see -operations [09:03:35] ok, good to know. Sorry for being 2 minutes late to the party, babysitting ended later than expected [09:03:53] np [09:04:56] akosiaris: no problem at all :) [09:05:28] akosiaris: I had looked up the history on etherpad1002 just to see if there was something else needed apart from the service restart, which wasn't needed in the end! [09:07:56] marostegui: Yeah I think that etherpad now will survive quite a bit of time (couple of mins?) without DB access before systemd gives up. [09:08:17] After that amount of time, I think manual access is required [09:08:45] "now" is that something you changed or new codebase changes, maybe? [09:09:05] 10DBA, 10Wikimedia-Etherpad: Upgrade and restart m1 master (db1080) - https://phabricator.wikimedia.org/T271540 (10Marostegui) This was done. Downtime was from: 09:00:19 to 09:00:48, so 29 seconds of downtime [09:09:06] akosiaris: yeah, this time it was just 29 seconds of downtime [09:09:44] ah, you mean for this particular case [09:09:46] 10DBA, 10Orchestrator, 10User-Kormat: Enable report_host for mariadb - https://phabricator.wikimedia.org/T266483 (10Marostegui) [09:09:54] I though that you meant it was something happening recently [09:09:56] 10DBA, 10Wikimedia-Etherpad: Upgrade and restart m1 master (db1080) - https://phabricator.wikimedia.org/T271540 (10Marostegui) 05Open→03Resolved Closing this as resolved - thanks everyone for the help! ` root@db1080.eqiad.wmnet[(none)]> select @@report_host; +--------------------+ | @@report_host | +... [09:11:21] 10DBA, 10Orchestrator, 10User-Kormat: Enable report_host for mariadb - https://phabricator.wikimedia.org/T266483 (10Marostegui) [09:11:27] 9:11 backup for gerrit finished correctly [09:11:32] nice [09:12:11] jynus: systemd restart=Always was "recently" added IIRC [09:12:19] akosiaris, cool! [09:30:19] 10DBA, 10Orchestrator: Cleanup heartbeat.heartbeat on all production instances - https://phabricator.wikimedia.org/T268336 (10Marostegui) m1 table cleaned up [09:32:31] 10DBA, 10Orchestrator: Add m* sections to Orchestrator - https://phabricator.wikimedia.org/T272568 (10Marostegui) [09:34:43] 10DBA, 10Orchestrator: Add m* sections to Orchestrator - https://phabricator.wikimedia.org/T272568 (10Marostegui) p:05Triage→03Medium [09:40:48] PROBLEM - MariaDB sustained replica lag on db1111 is CRITICAL: 2 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1111&var-port=9104 [09:41:27] 10DBA: Switchover s4 (commonswiki) from db1081 to db1138 - https://phabricator.wikimedia.org/T271427 (10Marostegui) Added to the deployments calendar [09:43:02] RECOVERY - MariaDB sustained replica lag on db1111 is OK: (C)2 ge (W)1 ge 0.8 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1111&var-port=9104 [09:44:42] that's wikidata, should we worry? [09:44:50] 10DBA, 10wikitech.wikimedia.org, 10User-notice, 10cloud-services-team (Kanban): Restart m5 master (db1128) - https://phabricator.wikimedia.org/T272388 (10Marostegui) Added to the deployments calendar [09:45:29] ^hoo [09:45:50] massive spike of reads from what I can see [09:46:07] including the master [09:46:22] hoo https://logstash.wikimedia.org/goto/128b0299f697607ab74ef51ade146539 [09:46:44] why only affecting 1111 then? [09:46:54] maybe the alert only fired there [09:46:59] but the spike happened everywhere [09:47:03] no, check logstash [09:47:09] almost all error there [09:47:13] *errors [09:47:29] that has the higher weight [09:47:34] ah [09:47:44] but the graphs show the spikes on all the hosts, so looks like something hit everything at the same time [09:47:49] including the master too [09:48:05] I cced hoo for a reason :-) [09:48:07] might be a massive query, as the traffic spike is on the sent [09:48:10] The maintenance script rebuilds the entire table (yuck...) [09:48:16] and the rnd_next handler [09:48:21] hoo: :-/ [09:48:34] hoo that is ok, but needs more waitForreplicas() [09:48:42] :-) [09:49:39] I don't think it was a massive incident, but the fact that affects all servers is worrying [09:51:07] I will file a ticket [09:53:26] Thanks :) If we are lucky this "temporary" hacked up thing is not going to live much longer also [10:01:13] can I quote on ticket for the rebuilding of the entire table? [10:01:20] *quote you [10:01:26] ^hoo [10:04:39] Sure [10:29:52] 10Blocked-on-schema-change, 10DBA: Drop default of revactor_timestamp - https://phabricator.wikimedia.org/T267767 (10Kormat) [10:44:25] 10Blocked-on-schema-change, 10DBA: Drop default of revactor_timestamp - https://phabricator.wikimedia.org/T267767 (10Kormat) [11:35:06] 10DBA, 10cloud-services-team (Kanban): Move wikireplicas under the new sanitarium hosts (db1154, db1155) - https://phabricator.wikimedia.org/T272008 (10Marostegui) [11:35:09] 10DBA, 10cloud-services-team (Kanban): Move wikireplicas under the new sanitarium hosts (db1154, db1155) - https://phabricator.wikimedia.org/T272008 (10Marostegui) clouddb1015:3316 moved - clouddb1019:3316 is down due to HW issues: T272125 [11:35:15] 10DBA, 10cloud-services-team (Kanban): Move wikireplicas under the new sanitarium hosts (db1154, db1155) - https://phabricator.wikimedia.org/T272008 (10Marostegui) [11:35:17] 10DBA, 10SRE, 10ops-eqiad: Memory errors on clouddb1019 - https://phabricator.wikimedia.org/T272125 (10Marostegui) [13:26:08] 10DBA, 10Orchestrator, 10User-Kormat: Enable report_host for mariadb - https://phabricator.wikimedia.org/T266483 (10Marostegui) [13:26:10] 10DBA, 10Orchestrator: Add m* sections to Orchestrator - https://phabricator.wikimedia.org/T272568 (10Marostegui) [13:31:08] 10DBA, 10Phabricator: Restart m3 (phabricator) database master db1132 - https://phabricator.wikimedia.org/T272596 (10Marostegui) [13:31:22] 10DBA, 10Phabricator: Restart m3 (phabricator) database master db1132 - https://phabricator.wikimedia.org/T272596 (10Marostegui) p:05Triage→03Medium [13:32:30] 10DBA, 10Phabricator: Restart m3 (phabricator) database master db1132 - https://phabricator.wikimedia.org/T272596 (10Marostegui) [13:34:38] 10Blocked-on-schema-change, 10DBA: Schema change for timestamp field of uploadstash - https://phabricator.wikimedia.org/T270055 (10Marostegui) [14:12:49] 10Blocked-on-schema-change, 10DBA: Drop default of revactor_timestamp - https://phabricator.wikimedia.org/T267767 (10Kormat) [14:16:08] 10Blocked-on-schema-change, 10DBA: Drop default of revactor_timestamp - https://phabricator.wikimedia.org/T267767 (10Kormat) [14:18:53] 10Blocked-on-schema-change, 10DBA: Drop default of revactor_timestamp - https://phabricator.wikimedia.org/T267767 (10Kormat) [14:20:11] 10Blocked-on-schema-change, 10DBA: Drop default of revactor_timestamp - https://phabricator.wikimedia.org/T267767 (10Kormat) [14:21:21] 10Blocked-on-schema-change, 10DBA: Drop default of revactor_timestamp - https://phabricator.wikimedia.org/T267767 (10Kormat) [14:22:54] 10Blocked-on-schema-change, 10DBA: Drop default of revactor_timestamp - https://phabricator.wikimedia.org/T267767 (10Kormat) [15:19:02] 10Blocked-on-schema-change, 10DBA: Drop default of revactor_timestamp - https://phabricator.wikimedia.org/T267767 (10Kormat) [15:19:51] 10Blocked-on-schema-change, 10DBA: Drop default of revactor_timestamp - https://phabricator.wikimedia.org/T267767 (10Kormat) [15:20:00] 10Blocked-on-schema-change, 10DBA: Drop default of revactor_timestamp - https://phabricator.wikimedia.org/T267767 (10Kormat) All of codfw is now done. [15:30:52] 10Blocked-on-schema-change, 10DBA: Schema change for timestamp field of uploadstash - https://phabricator.wikimedia.org/T270055 (10Marostegui) [15:50:23] 10Blocked-on-schema-change, 10DBA: Schema change for timestamp field of uploadstash - https://phabricator.wikimedia.org/T270055 (10Marostegui) [16:46:30] 10DBA: m2 codfw master crashed - https://phabricator.wikimedia.org/T272614 (10Marostegui) [16:52:43] 10DBA: m2 codfw master crashed - https://phabricator.wikimedia.org/T272614 (10LSobanski) p:05Triage→03High [16:53:52] 10DBA: m2 codfw master crashed - https://phabricator.wikimedia.org/T272614 (10WDoranWMF) @Marostegui Just acknowledging that we've seen this and will wait for your input. Let us know if we can help directly with debugging. [16:55:02] 10DBA: m2 codfw master crashed - https://phabricator.wikimedia.org/T272614 (10Marostegui) ` Jan 21 16:09:57 db2133 mysqld[24583]: 210121 16:09:57 [ERROR] mysqld got signal 11 ; Jan 21 16:09:57 db2133 mysqld[24583]: This could be because you hit a bug. It is also possible that this binary Jan 21 16:09:57 db2133 m... [17:21:17] 10DBA: m2 codfw master crashed - https://phabricator.wikimedia.org/T272614 (10Marostegui) ` [18346237.601951] mysqld[24681]: segfault at 8 ip 000055dd76adaec6 sp 00007f488ded11e0 error 4 in mysqld[55dd7667c000+8e4000] ` There is not much logged on what caused the crash, apart that it was during the index creati... [17:24:18] 10DBA: m2 codfw master crashed - https://phabricator.wikimedia.org/T272614 (10gmodena) @Marostegui ack. I'm deeply sorry about this. Did we hit a memory/resource limit in mysql? We did indexed a sister table this morning, which is an order of magnitude bigger, with no issue. We don't expect to touch these inde... [17:38:23] 10DBA: m2 codfw master crashed - https://phabricator.wikimedia.org/T272614 (10WDoranWMF) @Marostegui Thanks for your help, we'll make sure to coordinate any other changes. [17:39:48] 10DBA: m2 codfw master crashed - https://phabricator.wikimedia.org/T272614 (10Marostegui) >>! In T272614#6766168, @gmodena wrote: > > From the client side of things (process list & mysql cli), I did realise the query caused a crash: > What did you observe? [17:45:55] 10DBA: m2 codfw master crashed - https://phabricator.wikimedia.org/T272614 (10jcrespo) >> From the client side of things (process list & mysql cli), I did realise the query caused a crash: >> > > What did you observe? I'm going to guess there is a `n't` missing on the sentence. @gmodena Please don't get stres... [17:52:02] marostegui: is anything using 'x2' on codfw yet? [17:52:02] 10DBA, 10Patch-For-Review: Productionize x2 databases - https://phabricator.wikimedia.org/T269324 (10CDanis) Sorry for the truly baffling error message. The problem turned out to be that section `x2` on codfw did not have `flavor=external` set (it was flavor=regular instead). Now this works: `✔️ cdanis@cumin... [17:52:13] basically asking if I can commit the dbctl diff I just created ^ [18:02:22] cdanis: Manuel is out for the day. To the best of my knowledge the hosts were blocked from usage by this very problem. [18:02:36] yeah, was worried it was a bit late for him [18:02:46] for now I'll undo the diff, but, he should be unblocked now [18:03:52] Sounds good, thanks. [18:36:32] 10DBA: m2 codfw master crashed - https://phabricator.wikimedia.org/T272614 (10gmodena) >>! In T272614#6766222, @Marostegui wrote: >>>! In T272614#6766168, @gmodena wrote: >> >> From the client side of things (process list & mysql cli), I did realise the query caused a crash: > What did you observe? ^ typo, sor... [18:39:29] 10DBA: m2 codfw master crashed - https://phabricator.wikimedia.org/T272614 (10gmodena) >>! In T272614#6766232, @jcrespo wrote: >>> From the client side of things (process list & mysql cli), I did realise the query caused a crash: >>> >> >> What did you observe? > > I'm going to guess there is a `n't` missing... [19:32:06] db1118 seems a bit overloaded, with 25K QPS [19:33:42] we are serving spikes of 140K QPS just for enwiki currently [19:34:45] nothing out of the ordinary, I think I just I am not normally around during peak hours [19:36:30] I will create a paste so that people that read me tomorrow know what I am talking about: https://phabricator.wikimedia.org/P13879 [20:32:24] 10DBA, 10Analytics-Radar: mariadb on dbstore hosts, and specifically dbstore1004, possible memory leaking - https://phabricator.wikimedia.org/T270112 (10jcrespo) dbstore1004 again at 90% memory usage. [20:44:20] 10DBA, 10Phabricator: Restart m3 (phabricator) database master db1132 - https://phabricator.wikimedia.org/T272596 (10mmodell) @Marostegui I can do 06:00 UTC, I don't usually go to bed until a bit later than that. I guess any day is fine, I think 06:00-07:00 UTC is actually the best time that would overlap both... [23:07:14] 10DBA, 10Phabricator, 10SRE: Grant phstats user SELECT rights to phstats user - https://phabricator.wikimedia.org/T272654 (10Urbanecm) [23:08:24] 10DBA, 10Phabricator, 10SRE: Grant phstats user SELECT rights to phstats user - https://phabricator.wikimedia.org/T272654 (10Urbanecm) [23:11:00] 10DBA, 10Phabricator, 10SRE, 10Patch-For-Review: Grant phstats user SELECT rights for phabricator_policy database - https://phabricator.wikimedia.org/T272654 (10Urbanecm) [23:50:33] 10DBA, 10DC-Ops, 10SRE, 10ops-eqiad: (Need By: 2020-11-29) rack/setup/install db11[51-76] - https://phabricator.wikimedia.org/T267043 (10RobH)