[00:07:43] 10DBA, 10DC-Ops, 10Operations, 10ops-eqiad: (Need By: 2020-08-31) rack/setup/install es10[26-34].eqiad.wmnet - https://phabricator.wikimedia.org/T260370 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` es1032.eqiad.wmnet ` The log can be found in `/v... [00:24:44] 10DBA, 10DC-Ops, 10Operations, 10ops-eqiad: (Need By: 2020-08-31) rack/setup/install es10[26-34].eqiad.wmnet - https://phabricator.wikimedia.org/T260370 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['es1032.eqiad.wmnet'] ` Of which those **FAILED**: ` ['es1032.eqiad.wmnet'] ` [00:27:03] 10DBA, 10DC-Ops, 10Operations, 10ops-eqiad: (Need By: 2020-08-31) rack/setup/install es10[26-34].eqiad.wmnet - https://phabricator.wikimedia.org/T260370 (10RobH) This gets to the debian loader, and halts on 'Probing EDD' which had no issues on the other hosts. I'm still investigating on what is different... [00:44:59] 10DBA, 10DC-Ops, 10Operations, 10ops-eqiad: (Need By: 2020-08-31) rack/setup/install es10[26-34].eqiad.wmnet - https://phabricator.wikimedia.org/T260370 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` es1032.eqiad.wmnet ` The log can be found in `/v... [00:45:01] 10DBA, 10DC-Ops, 10Operations, 10ops-eqiad: (Need By: 2020-08-31) rack/setup/install es10[26-34].eqiad.wmnet - https://phabricator.wikimedia.org/T260370 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['es1032.eqiad.wmnet'] ` Of which those **FAILED**: ` ['es1032.eqiad.wmnet'] ` [00:45:19] 10DBA, 10DC-Ops, 10Operations, 10ops-eqiad: (Need By: 2020-08-31) rack/setup/install es10[26-34].eqiad.wmnet - https://phabricator.wikimedia.org/T260370 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` es1032.eqiad.wmnet ` The log can be found in `/v... [00:53:04] 10DBA, 10DC-Ops, 10Operations, 10ops-eqiad: (Need By: 2020-08-31) rack/setup/install es10[26-34].eqiad.wmnet - https://phabricator.wikimedia.org/T260370 (10RobH) [01:11:39] 10DBA, 10DC-Ops, 10Operations, 10ops-eqiad: (Need By: 2020-08-31) rack/setup/install es10[26-34].eqiad.wmnet - https://phabricator.wikimedia.org/T260370 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['es1032.eqiad.wmnet'] ` and were **ALL** successful. [01:26:02] 10DBA: Productionize es20[26-34] and es10[26-34] - https://phabricator.wikimedia.org/T261717 (10RobH) [01:26:16] 10DBA, 10DC-Ops, 10Operations, 10ops-eqiad: (Need By: 2020-08-31) rack/setup/install es10[26-34].eqiad.wmnet - https://phabricator.wikimedia.org/T260370 (10RobH) 05Open→03Resolved >>! In T260370#6598690, @ops-monitoring-bot wrote: > Completed auto-reimage of hosts: > ` > ['es1032.eqiad.wmnet'] > ` > >... [04:15:23] 10DBA, 10Operations, 10Performance-Team, 10Platform Engineering, 10User-Kormat: Remove groups from db configs - https://phabricator.wikimedia.org/T263127 (10Krinkle) [05:35:37] 10DBA, 10DC-Ops, 10Operations, 10ops-eqiad: (Need By: 2020-08-31) rack/setup/install es10[26-34].eqiad.wmnet - https://phabricator.wikimedia.org/T260370 (10Marostegui) Thanks Rob, es1032 looks good now: ` Name :Virtual Disk 0 RAID Level : Primary-1, Secondary-0, RAID Level Qualifier... [05:37:20] 10DBA, 10Operations, 10ops-eqiad, 10Patch-For-Review: db1091 crashed - https://phabricator.wikimedia.org/T225060 (10Marostegui) [05:37:51] 10DBA, 10decommission-hardware: decommission db1091.eqiad.wmnet - https://phabricator.wikimedia.org/T267088 (10Marostegui) [05:38:13] 10DBA, 10Operations, 10ops-eqiad, 10Patch-For-Review: db1091 crashed - https://phabricator.wikimedia.org/T225060 (10Marostegui) [05:38:15] 10DBA, 10decommission-hardware: decommission db1091.eqiad.wmnet - https://phabricator.wikimedia.org/T267088 (10Marostegui) [05:47:31] 10DBA, 10Operations, 10Performance-Team, 10Platform Engineering, and 2 others: Document remaining database load groups - https://phabricator.wikimedia.org/T267077 (10nnikkhoui) Thanks @ArielGlenn! I put a really simple/generic blurb in the attached patchset, please feel free to comment/amend however you th... [05:50:24] 10DBA: Productionize es20[26-34] and es10[26-34] - https://phabricator.wikimedia.org/T261717 (10Marostegui) [05:53:50] 10DBA: Productionize es20[26-34] and es10[26-34] - https://phabricator.wikimedia.org/T261717 (10Marostegui) [06:20:53] 10DBA, 10Patch-For-Review: Productionize es20[26-34] and es10[26-34] - https://phabricator.wikimedia.org/T261717 (10Marostegui) On-going transfers: es1011 -> es1026 es1012 -> es1027 es1014 -> es1028 Due to T262388 and how tight we are on this goal, I am directly using netcat to transfer the data. [06:36:29] 10DBA, 10Data-Services, 10cloud-services-team (Kanban): Productionize clouddb10[13-20] - https://phabricator.wikimedia.org/T267090 (10Marostegui) [06:36:45] 10DBA, 10Data-Services, 10cloud-services-team (Kanban): Productionize clouddb10[13-20] - https://phabricator.wikimedia.org/T267090 (10Marostegui) p:05Triage→03Medium [06:37:01] 10DBA, 10Patch-For-Review: Productionize es20[26-34] and es10[26-34] - https://phabricator.wikimedia.org/T261717 (10Marostegui) p:05Medium→03High [06:43:11] 10DBA, 10Operations, 10Performance-Team, 10Platform Engineering, and 2 others: Document remaining database load groups - https://phabricator.wikimedia.org/T267077 (10ArielGlenn) >>! In T267077#6598886, @nnikkhoui wrote: > Thanks @ArielGlenn! I put a really simple/generic blurb in the attached patchset, ple... [06:43:32] 10DBA, 10Data-Services, 10cloud-services-team (Kanban): Productionize clouddb10[13-20] - https://phabricator.wikimedia.org/T267090 (10Marostegui) [06:48:06] 10DBA, 10Datasets-General-or-Unknown, 10Patch-For-Review, 10Sustainability (Incident Followup), 10WorkType-NewFunctionality: Automate the check and fix of object, schema and data drifts between mediawiki HEAD, production masters and slaves - https://phabricator.wikimedia.org/T104459 (10Marostegui) [06:48:08] 10DBA: querycache qc_type and qc_title have different nullabality on s1 only - https://phabricator.wikimedia.org/T265349 (10Marostegui) 05Open→03Resolved This is all done [06:48:11] 10DBA: querycache qc_type and qc_title have different nullabality on s1 only - https://phabricator.wikimedia.org/T265349 (10Marostegui) [06:57:42] 10DBA, 10decommission-hardware, 10Patch-For-Review: decommission db1091.eqiad.wmnet - https://phabricator.wikimedia.org/T267088 (10Marostegui) [07:03:36] 10DBA, 10Patch-For-Review: Productionize es20[26-34] and es10[26-34] - https://phabricator.wikimedia.org/T261717 (10Marostegui) I have extended the volume on all the eqiad hosts. ` [07:03:11] marostegui@cumin1001:~$ sudo cumin 'es10[26-34].eqiad.wmnet' 'pvs' 9 hosts will be targeted: es[1026-1034].eqiad.wmnet... [07:51:35] 10DBA, 10Operations: Refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) [07:51:37] 10DBA, 10decommission-hardware, 10Patch-For-Review: decommission db1091.eqiad.wmnet - https://phabricator.wikimedia.org/T267088 (10Marostegui) [07:52:59] 10DBA, 10Operations: db1080-95 batch possibly suffering BBU issues - https://phabricator.wikimedia.org/T258386 (10Marostegui) [07:53:01] 10DBA, 10decommission-hardware, 10Patch-For-Review: decommission db1091.eqiad.wmnet - https://phabricator.wikimedia.org/T267088 (10Marostegui) [08:37:09] database backups are back to 3 months retention, now that almost all snapshots have been purged [08:42:49] 10DBA, 10Operations, 10Performance-Team, 10Platform Engineering, 10User-Kormat: Remove groups from db configs - https://phabricator.wikimedia.org/T263127 (10jcrespo) A reminder that T195578 is waiting for feedback to see if it would be useful to gather query performance statistics. [08:46:21] 10DBA, 10Privacy Engineering, 10WMF-Legal, 10Performance Issue, and 2 others: Deploy access to performance_schema/sys for the administrative mediawiki account (mediawiki deployers) - https://phabricator.wikimedia.org/T195578 (10jcrespo) I did a small introduction to Ladsgroup on how to use sys/performance_... [08:51:51] did any of you check logstash? [08:52:00] the db dashboards look weird [08:52:33] they look normal to mee [08:52:41] I have constantly open, the fatals, queries and the db errors [08:52:46] I haven't noticed any change on them [08:52:56] maybe reload on a separate tab? or it could be me [08:53:55] "Events over time" timeline is showing this to me: [08:53:55] Weird, I have them refreshing every minute, but I had to refresh the whole page to see that [08:54:04] ok, so you see it now? [08:54:19] Yeah, but only on dbquery and on errors, not on fatals [08:54:22] that view may have been edited [08:54:25] any upgrade to logstash? [08:54:53] not yet, I think [08:55:00] the graph may have been edited [08:56:32] 10DBA: Monitor the growth of CheckUser tables at large wikis - https://phabricator.wikimedia.org/T265344 (10Marostegui) [08:57:18] I am going to apply what I think is a fix [08:58:09] 10DBA: Monitor the growth of CheckUser tables at large wikis - https://phabricator.wikimedia.org/T265344 (10Marostegui) [08:58:41] marostegui: check now? [08:58:50] checking [08:59:00] I think either an error or someone edited the "Events over time" visualization [08:59:07] Looking good one [08:59:21] But something changed yesterday, no? as friday it looked fine [08:59:29] (and yesterday I wasn't here, so no idea if it was wrong) [08:59:36] I would say yesterday was ok [09:00:17] So what changed on events over time? [09:00:23] 10DBA: Monitor the growth of CheckUser tables at large wikis - https://phabricator.wikimedia.org/T265344 (10Marostegui) 05Open→03Resolved I am going to close this as fixed per T265344#6583817 [09:01:57] what changed was that on x-axis, instead of date histogram -> interval Auto, it had "date range" [09:02:12] so it agregated all in a single day [09:02:22] rather than changing based on resolution [09:03:02] maybe someone created a new dashboard but instead of copying the visualization for editing, it edited it in place [09:03:18] Do we have logstash blame tool? :) [09:03:29] sadly I didn't foung a history tool [09:03:46] Thanks for getting it fixed [09:04:07] sorry I asked but I thought it was, like you, an update issue [09:09:29] marostegui: you should BTW check https://logstash-next.wikimedia.org/app/dashboards#/view/87348b60-90dd-11e8-8687-73968bebd217 [09:13:06] you mean the new version? [09:13:16] I've CCed you on the ticket [09:13:43] which ticket? [09:14:07] https://phabricator.wikimedia.org/T234854#6439791 [09:14:07] Ah you just cc'ed me [09:14:11] yeah [09:14:13] I was checking for emails and it wasn't there yet [09:14:24] he he [09:14:35] I get the pop-up if I am online so I get it faster [09:37:18] BTW, I am using db1133 as a test host for media backups [10:11:30] 10DBA, 10Orchestrator: Support running orchestrator with sqlite backend - https://phabricator.wikimedia.org/T266657 (10Kormat) 05Stalled→03Resolved a:03Kormat This is done. [10:12:48] 10DBA, 10Operations, 10Orchestrator: Run orchestrator as non-root - https://phabricator.wikimedia.org/T266656 (10Kormat) [10:12:50] 10DBA, 10Operations, 10Orchestrator, 10Patch-For-Review: Repackage orchestrator - https://phabricator.wikimedia.org/T266763 (10Kormat) 05Open→03Resolved a:03Kormat Done and deployed. [10:13:03] 10DBA, 10Orchestrator: Support running orchestrator with sqlite backend - https://phabricator.wikimedia.org/T266657 (10Kormat) [10:13:05] 10DBA, 10Operations, 10Patch-For-Review, 10User-Kormat: orchestrator: Get packages into WMF apt - https://phabricator.wikimedia.org/T266023 (10Kormat) [10:13:12] 10DBA, 10Operations, 10Orchestrator: Run orchestrator as non-root - https://phabricator.wikimedia.org/T266656 (10Kormat) 05Open→03Resolved a:03Kormat [10:37:47] 10DBA, 10Operations, 10Orchestrator: Orchestrator binary needs version embedded - https://phabricator.wikimedia.org/T267113 (10Kormat) [10:38:21] 10DBA, 10Operations, 10Orchestrator: Orchestrator binary needs version embedded - https://phabricator.wikimedia.org/T267113 (10Kormat) When this is fixed, we also need to clean up the existing db: ` root@db2093.codfw.wmnet[orchestrator]> select * from orchestrator_db_deployments; +------------------+-------... [11:23:26] This is an example of a good question of "What is a file?": https://test.wikipedia.org/wiki/File:Test_image.jpg [11:23:39] an image was overwritten, and then reverted [11:23:58] are those 2 images or 3 images (files)? [11:24:14] All I care is...that beauty yours?? [11:24:32] not mine, mine is the https://upload.wikimedia.org/wikipedia/test/archive/9/95/20201103112136%21Test_image.jpg [11:24:37] :-) [11:24:38] :( [13:14:02] 10DBA, 10Operations, 10Orchestrator: Orchestrator binary needs version embedded - https://phabricator.wikimedia.org/T267113 (10Kormat) Fix deployed, and db cleaned up: ` root@db2093.codfw.wmnet[orchestrator]> delete from orchestrator_db_deployments where deployed_version="" limit 1; Query OK, 1 row affected... [13:14:14] 10DBA, 10Operations, 10Orchestrator, 10Patch-For-Review, 10User-Kormat: orchestrator: Puppetize - https://phabricator.wikimedia.org/T265990 (10Kormat) [13:14:17] 10DBA, 10Operations, 10Orchestrator: Orchestrator binary needs version embedded - https://phabricator.wikimedia.org/T267113 (10Kormat) 05Open→03Resolved a:03Kormat [13:33:27] 10DBA, 10decommission-hardware, 10Patch-For-Review: decommission db1091.eqiad.wmnet - https://phabricator.wikimedia.org/T267088 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by lsobanski@cumin1001 for hosts: `db1091.eqiad.wmnet` - db1091.eqiad.wmnet (**PASS**) - Downtimed host on Icinga... [14:08:15] 10DBA, 10Orchestrator, 10Patch-For-Review: Auto detect DC on orchestrator UI - https://phabricator.wikimedia.org/T266635 (10Kormat) Running this to enact the change in codfw: ` for i in $(mysql.py -h db1115 -A zarcillo -BN -e "select name from instances where server like '%codfw%'"); do echo "====> $i" m... [14:16:16] 10DBA, 10Orchestrator: Auto detect DC on orchestrator UI - https://phabricator.wikimedia.org/T266635 (10Kormat) The equivalent for eqiad is now done, too. [14:17:08] 10DBA, 10Orchestrator: Auto detect DC on orchestrator UI - https://phabricator.wikimedia.org/T266635 (10Kormat) 05Open→03Resolved a:03Kormat DC detection is now working \o/ [14:17:23] 10DBA, 10Orchestrator: Auto detect DC on orchestrator UI - https://phabricator.wikimedia.org/T266635 (10Marostegui) Excellent! pc1 and pc2 are now showing correct DCs. db1077 picked the change automatically and pc2 did as well, as I never touched those set of hosts. Thanks! [14:44:32] 10DBA, 10Orchestrator: Investigate moving replicas around with Orchestrator doesn't result on skipped transactions - https://phabricator.wikimedia.org/T267133 (10Marostegui) [14:44:40] 10DBA, 10Orchestrator: Investigate moving replicas around with Orchestrator doesn't result on skipped transactions - https://phabricator.wikimedia.org/T267133 (10Marostegui) p:05Triage→03Medium [15:31:24] 10DBA, 10Orchestrator: Investigate moving replicas around with Orchestrator doesn't result on skipped transactions - https://phabricator.wikimedia.org/T267133 (10Marostegui) On the host itself: ` Nov 03 14:39:31 db1077 mysqld[31754]: 2020-11-03 14:39:31 203499 [Note] Slave SQL thread exiting, replication stopp... [15:44:24] 10DBA: Monitor the growth of CheckUser tables at large wikis - https://phabricator.wikimedia.org/T265344 (10Urbanecm) Thanks @Marostegui! So, I guess we can now enable that at the wikis excluded at T253802#6536344, and create a third monitoring task? [15:45:12] 10DBA: Monitor the growth of CheckUser tables at large wikis - https://phabricator.wikimedia.org/T265344 (10Marostegui) Yes, let's do that, I want to make sure `enwiki` and friends are monitored. Thanks! [15:47:36] 10DBA, 10Orchestrator, 10Patch-For-Review, 10User-Kormat: Enable report_host for mariadb - https://phabricator.wikimedia.org/T266483 (10Kormat) s5/codfw is now mostly done: P13146#73111 I've not touched the dbstore hosts, as backups run on tuesdays. [15:49:41] 10DBA, 10Orchestrator, 10Patch-For-Review, 10User-Kormat: Enable report_host for mariadb - https://phabricator.wikimedia.org/T266483 (10jcrespo) > I've not touched the dbstore hosts, as backups run on tuesdays. I can take care of those when there is no activity, as long as the config is already deployed. [15:50:40] 10DBA, 10Orchestrator, 10Patch-For-Review, 10User-Kormat: Enable report_host for mariadb - https://phabricator.wikimedia.org/T266483 (10Kormat) @jcrespo : perfect. Yep, just needs mariadb restarted to pick it up. [16:53:42] 10DBA, 10Orchestrator, 10Patch-For-Review, 10User-Kormat: Enable report_host for mariadb - https://phabricator.wikimedia.org/T266483 (10Kormat) s1/s2/s3/s5 in codfw are done, excluding the sanitarium masters + sanitariums, and dbstore hosts. [16:56:49] 10DBA, 10Operations: db1080-95 batch possibly suffering BBU issues - https://phabricator.wikimedia.org/T258386 (10Cmjohnson) [16:56:51] 10DBA, 10Operations: Refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Cmjohnson) [16:56:56] 10DBA, 10Operations, 10ops-eqiad, 10Patch-For-Review: db1091 crashed - https://phabricator.wikimedia.org/T225060 (10Cmjohnson) [17:38:18] 10DBA, 10Operations, 10ops-eqiad: db1139 memory errors on boot 2020-08-27 - https://phabricator.wikimedia.org/T261405 (10Cmjohnson) The mainboard arrived [18:34:10] accoding to mysql, there are 11025 files on testwiki, according to swift there are 10500 :-/ [19:19:55] 10DBA, 10Orchestrator, 10Patch-For-Review, 10User-Kormat: Enable report_host for mariadb - https://phabricator.wikimedia.org/T266483 (10jcrespo) BTW, check if prometheus exporter daemon needs a shake on restarted host, there is quite a few collection failures showing on grafana (unless it is something else). [19:56:50] 10DBA, 10Orchestrator, 10Patch-For-Review, 10User-Kormat: Enable report_host for mariadb - https://phabricator.wikimedia.org/T266483 (10Marostegui) Most likely they need a restart, for those running mariadb 10.4, due to that bug we saw when testing 10.4