[04:26:58] 10DBA: Compress enwiki InnoDB tables - https://phabricator.wikimedia.org/T254462 (10Marostegui) [04:40:57] 10DBA, 10Patch-For-Review: Relocate "old" s4 hosts - https://phabricator.wikimedia.org/T253217 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts: ` ['db1135.eqiad.wmnet'] ` The log can be found in `/var/log/wmf-auto-reimage/202006290440_marostegui_2786... [04:46:47] 10DBA, 10Operations, 10Wikimedia-Mailing-lists: Create databases for mailman3 - https://phabricator.wikimedia.org/T256538 (10Marostegui) p:05Triage→03Medium @herron any idea how big these DBs can be and how many writes we'd be expecting? Which grants would be needed? I would assume we do need backups, r... [04:55:10] 10DBA, 10Performance-Team: Database for XHGui profiles - https://phabricator.wikimedia.org/T254795 (10Marostegui) @dpifke does this work as expected? [04:55:12] 10DBA, 10Performance-Team: Database for XHGui profiles - https://phabricator.wikimedia.org/T254795 (10Marostegui) p:05Triage→03Medium [05:00:03] 10DBA, 10Patch-For-Review: Relocate "old" s4 hosts - https://phabricator.wikimedia.org/T253217 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db1135.eqiad.wmnet'] ` and were **ALL** successful. [07:08:52] 10DBA, 10Operations, 10Wikimedia-Mailing-lists: Create databases for mailman3 - https://phabricator.wikimedia.org/T256538 (10jcrespo) > mailman3web will have the emails That is more concerning, not because it is not doable, but because with attachments, the other database storing organization's emails on a... [07:11:28] 10DBA, 10Patch-For-Review: Relocate "old" s4 hosts - https://phabricator.wikimedia.org/T253217 (10Marostegui) [07:44:46] lag spike monitoring is merged (again), hopefully this time without a typo that causes massive spam. [07:47:41] 10DBA, 10Patch-For-Review, 10User-Kormat: Create prometheus alert to detect lag spikes - https://phabricator.wikimedia.org/T253120 (10Kormat) 05Open→03Stalled The alert is now active, stalling this until we have some actionable feedback about how to tune it. [07:47:43] 10DBA, 10observability, 10Epic: Improve database alerting (tracking) - https://phabricator.wikimedia.org/T172492 (10Kormat) [07:48:34] 10DBA, 10Patch-For-Review, 10User-Kormat: Create prometheus alert to detect lag spikes - https://phabricator.wikimedia.org/T253120 (10Marostegui) Also: we need to upgrade or create a new section with how to proceed if this alert fires up [07:49:08] marostegui: i thought you already did that ;) [07:49:48] No, I started and left it as a draft :) [07:49:56] ah haha [08:03:35] 10DBA: Compress enwiki InnoDB tables - https://phabricator.wikimedia.org/T254462 (10Marostegui) [08:10:13] 10DBA, 10User-Kormat: Create testing environment for db automation - https://phabricator.wikimedia.org/T256602 (10Kormat) [08:10:16] what do you think of https://gerrit.wikimedia.org/r/c/operations/puppet/+/608053 ? [08:10:40] jynus, marostegui: please have a read of https://phabricator.wikimedia.org/T256602, i'd like to discuss it in our meeting today [08:11:01] jynus: deploying the package? [08:11:08] yes [08:11:47] I think that is a good idea [08:12:04] I am asking if you are confident about deploying the current version [08:12:08] I think we also discussed it for switchover.py but not a priority [08:12:10] jynus: ah [08:12:19] I have tested it with 3 transfer for now [08:12:22] and they worked fine [08:12:40] so can I deploy it? [08:13:12] let me +1 it to be clear! [08:14:07] so I was asking if you saw a blocker/danger [08:14:29] but also if +1ed to use the new version [08:15:40] 10DBA, 10User-Kormat: Create testing environment for db automation - https://phabricator.wikimedia.org/T256602 (10Marostegui) I like the idea of having a "pre" testing environment. We usually test with codfw (when we are relatively confident) but I can see a benefit of having a "pre" codfw testing environment... [08:15:53] jynus: nope, so far everything (basic transfers) have worked solidly [08:16:10] jynus: are backups using it or still the old version? [08:16:19] I think that's our better testing grid, all the backups transfers I think [08:16:30] they are using the puppet version [08:16:38] that is why I wanted to deploy it, so it uses the new one [08:16:46] but I tested on production too a couple of times [08:17:21] let's go ahead then! [08:17:49] note the main user-level difference is the port: defaults to 4400 and auto-detects it [08:18:11] yeah, but that's really transparent for the final end user [08:18:32] on further releases we will do more user impacting changes [08:23:07] 10DBA, 10MediaWiki-General, 10MW-1.35-notes (1.35.0-wmf.34; 2020-05-26), 10Patch-For-Review, and 2 others: Normalise MW Core database language fields length - https://phabricator.wikimedia.org/T253276 (10Marostegui) [08:25:39] 10DBA: transferpy package does not depend on python3-yaml - https://phabricator.wikimedia.org/T256604 (10jcrespo) [08:30:37] x1 snapshots on backup*002 worked well over the weekend [08:30:45] 10DBA, 10MediaWiki-General, 10MW-1.35-notes (1.35.0-wmf.34; 2020-05-26), 10Patch-For-Review, and 2 others: Normalise MW Core database language fields length - https://phabricator.wikimedia.org/T253276 (10Marostegui) [08:31:06] jynus: good news [08:35:41] narrator: it was not good news [08:39:42] marostegui: there appear to be 2 cases where the alert gets NaN: section master _in primary dc_, and 'standalone' instances [08:40:04] ah, because show slave status is empty [08:40:06] no? [08:40:10] yep [08:40:17] standalone are easy to fix [08:40:38] do you know how to check in puppet if a given host is in the primary dc? [08:40:51] so the primary masters shouldn't be empty, but we disconnect replication from codfw to avoid issues when doing maintenance on codfw [08:41:02] kormat: the hosts on their yaml would have the master role [08:41:14] cat hieradata/hosts/db1083.yaml [08:41:14] # db1083 [08:41:14] mariadb::shard: 's1' [08:41:14] mariadb::mysql_role: 'master' [08:41:14] mariadb::binlog_format: 'STATEMENT' [08:41:36] marostegui: that's not sufficient. e.g in `s8`, both `db1109` and `db2079` have role `master` [08:41:47] but the latter also needs to be monitored for replication lag [08:42:14] $mw_primary on hosts store which is the primary dc [08:42:39] (for mw, misc hosts are a different story) [08:43:32] so a combination of that role + $mw_primary as jaime mentions? [08:43:59] that sounds good [08:44:00] $mysql_role is called [08:57:09] jynus: can i get a review of https://gerrit.wikimedia.org/r/c/operations/puppet/+/608012 please? rebasing it is painful as there are almost always conflicts :) [09:01:50] 10DBA, 10Patch-For-Review, 10User-Kormat: Create reuse recipes for tendril/zarcillo/dbprov/backup hosts - https://phabricator.wikimedia.org/T255768 (10Kormat) As @jcrespo pointed out - we cannot currently test the dbprov reuse recipe, but the next time we're installing a new dbprov host we should test it then. [09:24:27] is ! a valid puppet operator? [09:25:13] it is, I am mixing my languages [09:41:31] 10DBA: transferpy package does not depend on python3-yaml - https://phabricator.wikimedia.org/T256604 (10Privacybatm) I have added manually those dependencies! [09:44:45] 10DBA, 10MediaWiki-General, 10MW-1.35-notes (1.35.0-wmf.34; 2020-05-26), 10Patch-For-Review, and 2 others: Normalise MW Core database language fields length - https://phabricator.wikimedia.org/T253276 (10Marostegui) [11:08:07] 10DBA, 10MediaWiki-General, 10MW-1.35-notes (1.35.0-wmf.34; 2020-05-26), 10Patch-For-Review, and 2 others: Normalise MW Core database language fields length - https://phabricator.wikimedia.org/T253276 (10Marostegui) s2 eqiad progress [] labsdb1012 [] labsdb1011 [] labsdb1010 [] labsdb1009 [x] dbstore1004... [11:21:51] jynus: > I don't like the idea of storing a lot of meaningful data on a VM for practical reasons (backups, recovery, ...). [11:22:02] Let me tell how it stores passwords :D [12:17:46] 10DBA, 10Patch-For-Review, 10User-Kormat: Create reuse recipes for tendril/zarcillo/dbprov/backup hosts - https://phabricator.wikimedia.org/T255768 (10Kormat) 05Open→03Resolved Summary: - tendril/zarcillo: not currently feasible, and not worth the effort repartitioning them to make it feasible. - dbprov:... [12:17:48] 10DBA, 10Patch-For-Review: Make partman/custom/no-srv-format.cfg work - https://phabricator.wikimedia.org/T251768 (10Kormat) [12:23:26] 10DBA, 10Patch-For-Review: Upgrade x1 databases to Buster and Mariadb 10.4 - https://phabricator.wikimedia.org/T254871 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts: ` ['db2096.codfw.wmnet'] ` The log can be found in `/var/log/wmf-auto-reimage/2020... [12:44:04] kormat: marostegui: is it known that a lot of the "5-minute average replication lag is over 2s" alerts are evaluating as NaN? [12:44:23] Yes, kormat is WIP on those [12:44:28] ah ok [12:44:32] no, i'm waiting on review :P [12:44:42] https://gerrit.wikimedia.org/r/c/operations/puppet/+/608271 *hint* ;) [12:44:53] oh [12:45:01] Missed the email [12:45:04] Sorry [12:45:05] checking [12:45:19] on a scale of one to 10, how much i believe you: NaN [12:45:48] Yeah, I guess I need to disable the rule that sends all the emails from you to trash directly [12:45:55] :) [12:49:15] merged, running puppet on all the affected hosts currently [12:51:24] 10DBA, 10Patch-For-Review: Upgrade x1 databases to Buster and Mariadb 10.4 - https://phabricator.wikimedia.org/T254871 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db2096.codfw.wmnet'] ` and were **ALL** successful. [12:53:00] fixed \o/ [12:53:15] \o/ [14:10:08] 10DBA, 10Patch-For-Review: Upgrade x1 databases to Buster and Mariadb 10.4 - https://phabricator.wikimedia.org/T254871 (10Marostegui) [14:10:58] 10DBA, 10Patch-For-Review: Upgrade x1 databases to Buster and Mariadb 10.4 - https://phabricator.wikimedia.org/T254871 (10Marostegui) Pending: schedule x1 master switchover [14:27:33] !bash kormat: on a scale of one to 10, how much i believe you: NaN [14:27:33] Amir1: Stored quip at https://bash.toolforge.org/quip/XKF5AHMBj_Bg1xd33aex [14:27:49] :D [14:28:10] marostegui: have you seen the drift reports? [14:28:23] Amir1: hahah [14:30:20] hahaha [14:30:24] Amir1: no, where is it? [14:30:43] here :D I think I sent it in Friday, let me check [14:30:59] sorry for spam [14:31:01] marostegui: 1100 drifts. Mostly the MCR stuff [14:31:01] 5:19 PM I try to make it foldable so we ignore those for now [14:31:01] 6:14 PM → hashar joined ⇐ batm quit [14:31:01] 7:24 PM Amir Sarabadani marostegui: These are the drifts excluding MCR ones: https://phabricator.wikimedia.org/P11667 [14:31:01] 7:24 PM (in total, around 100-ish) [14:31:02] 7:24 PM This is all of them: https://phabricator.wikimedia.org/P11668 [14:31:02] 7:27 PM btw MCR schema changes caused around 10% in size reduction in s6: https://grafana.wikimedia.org/d/000000377/host-overview?panelId=28&fullscreen&orgId=1&var-server=db1131&var-datasource=thanos&var-cluster=mysql&from=1592944447567&to=1593103141444 [14:31:03] 7:27 PM is s1 and s8 it'll be massive [14:31:15] Amir1: oh thanks, I saw the notification and then I forgot to check it when I got home [14:31:44] nah, it was late (It was early for me :D) [14:31:54] yeah, in s6 the backups reduced like 30% I think jaime said [14:32:00] for that table [14:32:27] Amir1: ok, so we can create some tasks for those drifts [14:33:06] I'm so excited now [14:33:36] marostegui: do you want me to create them? I'm causing you work so let me do parts of it [14:33:49] Amir1: if you have time, yeah please :) [14:35:01] Awesome, I have something to procrastinate [14:35:09] haha [15:13:56] I hate when I run into things that affect 10.1 but are fixed on 10.4 so they affect us partially [15:13:57] grrr [16:30:15] 10DBA: Switchover s8 primary database master db1109 -> db1104 - Date TBD - https://phabricator.wikimedia.org/T239238 (10Marostegui) [16:30:19] 10DBA, 10Operations, 10Puppet, 10User-jbond: DB: perform rolling restart of mariadb daemons to pick up CA changes - https://phabricator.wikimedia.org/T239791 (10Marostegui) [18:03:44] 10DBA, 10Performance-Team: Database for XHGui profiles - https://phabricator.wikimedia.org/T254795 (10dpifke) Can you or @Dzahn add the password to PrivateSettings.php on deploy1001? Or drop it in my home directory there so I can update it? (I don't have access to the copy in puppet-private.) I'll verify th... [19:22:03] 10DBA: text table in db1144 drifts from core considerably - https://phabricator.wikimedia.org/T256679 (10Ladsgroup) [19:26:12] 10DBA: tl_from index on templatelinks is lingering in production - https://phabricator.wikimedia.org/T252126 (10Ladsgroup) 05Resolved→03Open Sorry, it is still happening on s4: `lang=json { "templatelinks tl_from index-mismatch-prod-extra": { "s4": [ "db1141.eqiad.wmnet", "db1... [19:26:17] 10DBA, 10Datasets-General-or-Unknown, 10Patch-For-Review, 10Sustainability (Incident Prevention), 10WorkType-NewFunctionality: Automate the check and fix of object, schema and data drifts between mediawiki HEAD, production masters and slaves - https://phabricator.wikimedia.org/T104459 (10Ladsgroup) [19:39:29] 10DBA: tl_from index on templatelinks is lingering in production - https://phabricator.wikimedia.org/T252126 (10Marostegui) 05Open→03Resolved Done! Thanks for reporting ` root@cumin1001:/home/marostegui# for i in db1141 db1121 db1148; do echo $i; mysql.py -h$i commonswiki -e "show create table templatelinks... [19:39:34] 10DBA, 10Datasets-General-or-Unknown, 10Patch-For-Review, 10Sustainability (Incident Prevention), 10WorkType-NewFunctionality: Automate the check and fix of object, schema and data drifts between mediawiki HEAD, production masters and slaves - https://phabricator.wikimedia.org/T104459 (10Marostegui) [19:50:12] 10DBA: text table in db1144 drifts from core considerably - https://phabricator.wikimedia.org/T256679 (10Marostegui) a:03Marostegui Thanks for reporting this - looks like this host slipped through the cracks with all the host moves we are doing lately. I have started the alter now there. Should be done by tomo... [19:51:18] 10DBA: imagelinks has index mismtach on s8 - https://phabricator.wikimedia.org/T256680 (10Ladsgroup) [19:57:02] 10DBA, 10Performance-Team: Database for XHGui profiles - https://phabricator.wikimedia.org/T254795 (10Dzahn) [19:59:47] 10DBA: text table in db1144 drifts from core considerably - https://phabricator.wikimedia.org/T256679 (10Marostegui) It was faster than expected as dewiki was ok and the other wikis are very small: ` # for i in `cat /home/marostegui/git/mediawiki-config/dblists/s5.dblist | grep -v "#" `; do echo $i; mysql.py -h... [20:00:25] 10DBA: text table in db1144 drifts from core considerably - https://phabricator.wikimedia.org/T256679 (10Marostegui) 05Open→03Resolved [20:00:29] 10DBA, 10Datasets-General-or-Unknown, 10Patch-For-Review, 10Sustainability (Incident Prevention), 10WorkType-NewFunctionality: Automate the check and fix of object, schema and data drifts between mediawiki HEAD, production masters and slaves - https://phabricator.wikimedia.org/T104459 (10Marostegui) [20:01:45] 10DBA, 10Operations, 10Wikimedia-Mailing-lists: Create databases for mailman3 - https://phabricator.wikimedia.org/T256538 (10herron) >>! In T256538#6262958, @Marostegui wrote: > @herron any idea how big these DBs can be and how many writes we'd be expecting? > Which grants would be needed? > > I would assum... [20:10:21] 10DBA: imagelinks has index mismatch on s8 - https://phabricator.wikimedia.org/T256680 (10Reedy) [20:12:57] 10DBA, 10Performance-Team: Database for XHGui profiles - https://phabricator.wikimedia.org/T254795 (10Dzahn) >>! In T254795#6265191, @dpifke wrote: > Can you or @Dzahn add the password to PrivateSettings.php on deploy1001? Or drop it in my home directory there so I can update it? (I don't have access to the... [20:15:40] 10DBA: page_restrictions indexes have been majestically drifting from code - https://phabricator.wikimedia.org/T256682 (10Ladsgroup) [20:17:10] 10DBA, 10Performance-Team: Database for XHGui profiles - https://phabricator.wikimedia.org/T254795 (10Dzahn) @dpifke The password is now in a file in your home dir on deploy1001, separate from that question above. [20:17:23] 10DBA, 10Performance-Team: Database for XHGui profiles - https://phabricator.wikimedia.org/T254795 (10Dzahn) a:05Marostegui→03dpifke [20:28:02] 10DBA: pl_from index still lingers in random hosts - https://phabricator.wikimedia.org/T256684 (10Ladsgroup) [20:32:55] 10DBA, 10Datasets-General-or-Unknown, 10Sustainability (Incident Prevention), 10WorkType-NewFunctionality: pl_namespace index on pagelinks is unique only in s8 - https://phabricator.wikimedia.org/T256685 (10Ladsgroup) [20:33:37] 10DBA: pl_namespace index on pagelinks is unique only in s8 - https://phabricator.wikimedia.org/T256685 (10Ladsgroup) [20:44:56] 10DBA: text table in db1144 drifts from core considerably - https://phabricator.wikimedia.org/T256679 (10Ladsgroup) Can it be that it's still missing on some databases in s3 and we haven't caught it because it's 900 wikis? It would be great if you do a quick double check when you have time. [20:47:36] 10DBA: text table in db1144 drifts from core considerably - https://phabricator.wikimedia.org/T256679 (10Marostegui) db1144 is in s5 (or that is what the report says) I checked all its wikis in s5 and they are now fixed. [20:50:42] 10DBA: text table in db1144 drifts from core considerably - https://phabricator.wikimedia.org/T256679 (10Ladsgroup) Yeah I know but since dewiki was okay but wikis like cebwiki were not and they used to be in s3, that's why I'm saying :D [21:35:23] 10DBA, 10Data-Services, 10Projects-Cleanup: Drop DB tables for now-deleted fixcopyrightwiki from production - https://phabricator.wikimedia.org/T246055 (10CCicalese_WMF) Is there anything that still needs to be done on this task?