[00:05:23] <wikibugs>	 10DBA, 10DC-Ops, 10Operations, 10ops-codfw, 10Patch-For-Review: (Need By: 31st May) rack/setup/install db213[6-9] and db2140 - https://phabricator.wikimedia.org/T251639 (10Papaul)
[05:57:38] <jynus>	 it took almost 4 hours for lag to clear on pc2007: https://grafana.wikimedia.org/d/000000273/mysql?panelId=6&fullscreen&orgId=1&var-dc=codfw%20prometheus%2Fops&var-server=pc2007&var-port=9104&from=1589263013743&to=1589349413743
[05:57:57] <jynus>	 that is worrying to me, does that host have performance problems?
[07:30:28] <kormat>	 jynus: i think the bottleneck is writing to disk
[07:32:22] <jynus>	 I can see: https://grafana.wikimedia.org/d/000000273/mysql?panelId=34&fullscreen&orgId=1&from=1589268735550&to=1589355135551&var-dc=codfw%20prometheus%2Fops&var-server=pc2007&var-port=9104
[07:32:39] <kormat>	 it was pegged at 80% i/o utilization for the duration of the resync
[07:32:55] <kormat>	 https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=pc2007&var-datasource=codfw%20prometheus%2Fops&var-cluster=mysql&from=1589294267344&to=1589324114617&fullscreen&panelId=6
[07:34:44] <kormat>	 jynus: pc2010 was in a similar state, too. it took almost 3 hours to resync
[07:34:48] <jynus>	 one thing you should know is that disk utilization is a terrible metric of disk utilization and saturation
[07:34:58] <kormat>	 how so?
[07:35:06] <jynus>	 you should look at iops or queueing rather than that meaning less percentage
[07:35:13] <jynus>	 because it is not a good indicator of the io load
[07:35:37] <jynus>	 I send people here for a summary: https://www.percona.com/blog/2017/08/28/looking-disk-utilization-and-saturation/
[07:35:49] <jynus>	 it cannot be used in the same way than cpu utilization
[07:35:58] <jynus>	 but that is not important
[07:36:08] <jynus>	 the io usage is concerning
[07:36:20] <kormat>	 will read, thanks
[07:36:33] <marostegui>	 taking 3-4h for those hosts to recover is expected
[07:36:53] <jynus>	 BTW i've been trying to get it removed from the host overview dashboard for this reason :-D
[07:37:07] <jynus>	 marostegui: I am a bit surprised by the amount of io they need
[07:37:25] <jynus>	 maybe we are using too strict mysql options for the pcs?
[07:39:06] <jynus>	 actually no: innodb_flush_log_at_trx_commit = 0 / sync_binlog = 0
[07:39:08] <kormat>	 jynus: ok, i get it now. thanks for the education :)
[07:39:29] <jynus>	 side note, not important
[07:39:38] <jynus>	 it was just a^
[07:39:56] <kormat>	 jynus: i disagree - it helps me get a better understanding of things, so it's important to me at least :)
[07:40:59] <jynus>	 so back to my initial concern
[07:41:05] <kormat>	 the reason i didn't put much weight on iops is simply because i don't know the specs of the hardware. e.g. is 4k iops a lot? is it max? no idea
[07:42:27] <kormat>	 anyway. sounds like this is "working within normal parameters" (whether that's good or not is a separate issue)
[07:42:52] <jynus>	 iops has been increased steadily over the last 5 months
[07:42:58] <jynus>	 it is not reimage-related
[07:43:27] <jynus>	 maybe the "solution" would be to setup more servers in the next refresh
[07:44:16] <jynus>	 See: https://grafana.wikimedia.org/d/000000273/mysql?panelId=33&fullscreen&orgId=1&from=1578436551491&to=1589355722198&var-dc=codfw%20prometheus%2Fops&var-server=pc2007&var-port=9104
[07:45:13] <jynus>	 to me that looks like a 15% increase over 5 months
[07:45:26] <jynus>	 not much, but may be more over the last few years
[07:46:11] <jynus>	 we could also experiment with more unsafe replication options for parsercaches
[07:46:25] <jynus>	 no hard actionables for now
[07:47:13] <jynus>	 for context, I am a bit worried this reappears: https://phabricator.wikimedia.org/T247787
[07:53:18] <kormat>	 marostegui: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/596146
[07:59:28] <kormat>	 marostegui: alright, now we get to the good stuff. how do i reimage a pc master? :)
[08:00:10] <marostegui>	 kormat: will be with you in 5 minutes
[08:00:23] <kormat>	 np :)
[08:06:54] <marostegui>	 kormat: so, the idea is to update pc1007, so we need to depool it on MW, and pool pc1010 instead. pc1010 replicates from pc1007 so it will have the same keys, so drops should be noticed
[08:08:10] <jynus>	 db1087 with ongoing errors
[08:08:37] <marostegui>	 the usual:  1788219428 | wikiadmin            | 10.64.16.77:46570  | wikidatawiki       | Query       |   25704 | Copying to tmp table on disk                                          | SELECT /* SpecialFewestRevisions::reallyDoQuery
[08:08:41] <marostegui>	 can you take care of that?
[08:08:41] <jynus>	 ha
[08:08:44] <jynus>	 yeah
[08:08:46] <marostegui>	 thanks
[08:09:36] <marostegui>	 I have set this to high: https://phabricator.wikimedia.org/T238199
[08:09:49] <jynus>	 don't worry, I have it
[08:09:54] <marostegui>	 thanks
[08:10:02] <marostegui>	 I don't have enough hands this week .(
[08:14:24] <kormat>	 marostegui: i have notes on how to change the mw config for this, but i don't know how to make pc2007 a slave without screwing up replication
[08:14:44] <marostegui>	 kormat: In this case, there is no need to change replication really
[08:14:47] <jynus>	 there is more wikiadmin processes, but I think I killed the impacting one
[08:14:56] <jynus>	 we'll see if it reoccurs
[08:15:04] <marostegui>	 jynus: make sure to kill it on mwmaint1002, otherwise it will start again :(
[08:15:14] <jynus>	 mwmaint?
[08:15:20] <jynus>	 I killed on snapshot
[08:15:22] <marostegui>	 kormat: there is no need to change replication, we'll lose a few keys (the ones that will get inserted on pc1010) but that is ok
[08:15:40] <jynus>	 it was the rdf exporter
[08:15:49] <marostegui>	 jynus: root@cumin1001:~# host 10.64.16.77
[08:15:50] <marostegui>	 77.16.64.10.in-addr.arpa domain name pointer mwmaint1002.eqiad.wmnet.
[08:15:53] <kormat>	 marostegui: so the process would be: tell mw to use pc1010 as pc1 master, stop replication on pc1010, reimage pc1007, start replication again on pc1010, and tell mw to use pc1007 again?
[08:16:02] <marostegui>	 | 1788219428 | wikiadmin            | 10.64.16.77:46570  | wikidatawiki       | Query       |   25704 | Copying to tmp table on disk                                          | SELECT /* SpecialFewestRevisions::reallyDoQuery  */  page_namespace AS `namespace`,page_title AS `ti |    0.000 |
[08:16:24] <jynus>	 in any case, that is not ongoing
[08:17:12] <marostegui>	 kormat: make sure to silence pc1010 replication checks, otherwise they'll page as pc1010 replicates from pc1007 and pc1007 will be down
[08:17:14] <jynus>	 anymore
[08:17:20] <marostegui>	 jynus: cool thanks
[08:17:27] <kormat>	 marostegui: ah hah, right
[08:32:40] <kormat>	 marostegui: https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/596152
[08:36:58] <kormat>	 damn you, i was just following https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/580117/ as a template :)
[08:37:16] <jynus>	 ha ha
[08:37:23] <jynus>	 do we we say, not as we do :-P
[08:37:47] <marostegui>	 Well, that was an emergency
[08:37:58] <marostegui>	 I didn't have much time for it
[08:38:32] <marostegui>	 kormat: this was the update to that, the day after https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/581328/
[08:38:41] <marostegui>	 to clarify the situation after the emergency
[08:39:46] <kormat>	 ok ok :)
[08:40:28] <kormat>	 marostegui: updated
[08:41:10] <kormat>	 oh, sorry, i missed part of it. fixing.
[08:45:13] <kormat>	 man. gerrit is still confusing as shit.
[08:45:44] <kormat>	 marostegui: PTAL
[08:45:58] <marostegui>	 I +1ed it
[08:48:01] <kormat>	 damnit, i missed that. (i refer you to gerrit being terrible :) thanks!
[08:53:20] <kormat>	 marostegui: alright, change deployed. i can confirm that 'show processlist' says that pc1007 is now idle, and pc1010 is very popular with mediawiki now
[08:53:36] <marostegui>	 cool - check what I mentioned on -operations
[08:53:40] <marostegui>	 before stopping mysql on pc1007
[08:53:52] <kormat>	 seen, on it.
[08:56:14] <kormat>	 marostegui: how does 6h downtime sound for this?
[08:56:54] <marostegui>	 sounds good to me
[08:57:24] <kormat>	 (rationale: it should be long enough if everything goes well; if it doesn't, we'll want to reconfigure replication anyway to keep pc1010 as master)
[08:58:01] <marostegui>	 6h sounds yeah, if we don't have issues during the install
[09:01:35] <marostegui>	 kormat: as expected, no changes here https://grafana.wikimedia.org/d/000000106/parser-cache?orgId=1&from=now-6h&to=now as pc1010 has the same keys
[09:01:46] <marostegui>	 when we'll do the same for pc1009, there will be a drop
[09:01:59] <marostegui>	 but it is ok
[09:02:16] <marostegui>	 we could mitigate it by putting pc1010 to replicate from pc1009 a few days earlier, but probably not even worth the effort
[09:02:45] <kormat>	 a drop of hit-rate?
[09:03:49] <marostegui>	 yeah
[09:04:11] <kormat>	 we should expect a drop when we re-master pc1007, right?
[09:04:35] <marostegui>	 should be minimal, as it will just miss the keys from now
[09:04:39] <marostegui>	 so hopefully just an hour or so
[09:05:02] <kormat>	 will pc1009 be worse?
[09:05:18] <marostegui>	 once we depool pc1009 and put pc1010 - yes
[09:05:21] <kormat>	 i've run 'stop slave;' on both pc1010 and pc2007 - i assume that's all i need to do there for now
[09:05:25] <marostegui>	 as pc1010 has no keys from pc1009 as I said above
[09:05:37] <marostegui>	 kormat: and silenced those checks?
[09:05:42] <kormat>	 ohhh. now i understand. there's only 2 hosts in pc3, and only 1 in eqiad
[09:05:47] <kormat>	 yep, that's already done
[09:05:56] <marostegui>	 then you are good
[09:05:58] <kormat>	 so to keep writes in eqiad, we need to put in a spare host
[09:06:02] <marostegui>	 right
[09:06:05] <kormat>	 gotcha :)
[09:06:31] <marostegui>	 pc1 pc2 and pc3 have only 2 hosts (1 eqiad - 1 codfw)
[09:06:38] <marostegui>	 we happen to have pc1010 and pc2010 as spares
[09:06:53] <marostegui>	 and just replicate from pc1, just because we had to choose one cluster
[09:06:56] <marostegui>	 and we chose pc1
[09:07:00] <marostegui>	 but they can go anywhere
[09:07:11] <marostegui>	 as they replicate from pc1, they get the keys from pc1
[09:07:37] <marostegui>	 that's why we didn't see a drop now
[09:07:44] <marostegui>	 but we will on pc3 once pc1010 replaces pc1009
[09:07:50] * kormat nods
[09:09:13] <kormat>	 marostegui: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/596158
[09:24:33] <wikibugs>	 10DBA, 10Patch-For-Review: Upgrade parsercache to Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T252182 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by kormat on cumin1001.eqiad.wmnet for hosts: ` ['pc1007.eqiad.wmnet'] ` The log can be found in `/var/log/wmf-auto-reimage/202005130...
[09:49:38] <wikibugs>	 10DBA, 10Upstream: Events set to SLAVESIDE_DISABLED when upgrading from 10.1 to 10.4 - https://phabricator.wikimedia.org/T247728 (10Marostegui) I have installed 10.4.13 (from 10.4.12) on db2102 and this is confirmed fixed The events were not disabled. Will be interesting to also check an update from 10.1 to 10...
[09:50:59] <wikibugs>	 10DBA, 10Upstream: Possibly disable optimizer flag: rowid_filter on 10.4 - https://phabricator.wikimedia.org/T245489 (10Marostegui) 05Open→03Resolved a:03Marostegui This looks fixed on 10.4.13 - just tried on db2102  Running 10.4.12: ` root@db2102.codfw.wmnet[ops]> set session optimizer_switch='index_mer...
[09:51:03] <wikibugs>	 10DBA, 10Epic, 10Patch-For-Review: Upgrade WMF database-and-backup-related hosts to buster - https://phabricator.wikimedia.org/T250666 (10Marostegui)
[09:52:50] <wikibugs>	 10DBA, 10Epic, 10Patch-For-Review: Upgrade WMF database-and-backup-related hosts to buster - https://phabricator.wikimedia.org/T250666 (10Marostegui) I am testing 10.4.13 package on db2102 - won't upload to the repo until next week or until parsercache hosts in eqiad are done, let's not experiment with maste...
[09:52:52] <wikibugs>	 10DBA, 10Patch-For-Review: Upgrade parsercache to Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T252182 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['pc1007.eqiad.wmnet'] `  and were **ALL** successful.
[09:56:44] <kormat>	 marostegui: ok, pc1007 is back. next steps: 'start slave' on pc1010 + pc2007, revert the mediawiki-config change, remove the downtimes. sound good?
[09:56:59] <marostegui>	 pc1007 all green?
[09:57:25] <kormat>	 yep
[09:57:30] <marostegui>	 then +1 yeah
[09:58:19] <jynus>	 I am guessing you showed him already the pc grafana and dberror logstash dashboards?
[09:58:41] <marostegui>	 no the logstash dashboard yet
[09:59:03] <kormat>	 i'm not sure i understand the pc grafana dashboard either :)
[09:59:03] <marostegui>	 tendril and grafana was enough for now
[09:59:25] <marostegui>	 kormat: I guess he means the aggregated, the individual and the parsercache hit ratio I showed earlier
[09:59:29] <kormat>	 i'm not sure what "content model" is, or why only pc1 is shown
[10:00:08] <jynus>	 don't worry about that for now :-D
[10:03:12] <kormat>	 marostegui: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/596173 is a direct revert of my earlier change
[10:03:45] <kormat>	 (i assume this is routine, but as i've only done this once before, i thought i'd check)
[10:04:45] <marostegui>	 yeah, that looks good
[10:05:42] <jynus>	 nice Re:T245489
[10:05:43] <stashbot>	 T245489: Possibly disable optimizer flag: rowid_filter on 10.4 - https://phabricator.wikimedia.org/T245489
[10:06:17] <jynus>	 10.5 rc also released
[10:06:48] <jynus>	 I think we did the right decision with 10.4
[10:10:09] <marostegui>	 i think so yeah
[10:10:36] <wikibugs>	 10DBA, 10Operations, 10ops-codfw, 10Patch-For-Review: db2097 memory errors leading to crash - https://phabricator.wikimedia.org/T252492 (10jcrespo)
[10:13:30] <wikibugs>	 10DBA, 10Operations, 10ops-codfw: db2097 memory errors leading to crash - https://phabricator.wikimedia.org/T252492 (10jcrespo) a:05jcrespo→03Papaul @Papaul please helps us out. This seems like an ordinary dimm failure, but we need to do the usual swap to discard board/processor. This happened before at...
[10:14:59] <wikibugs>	 10DBA: Re-build db2097 s1 and s6 - https://phabricator.wikimedia.org/T228613 (10jcrespo) 05Open→03Resolved a:03jcrespo This was rebuilt on second crash: T252492 (that is why it took me so much time to send it to dc ops). CC @Marostegui
[10:16:32] <jynus>	 one thing marostegui: if labsdb1011 ends up working, we may want to use backup1002 for a second copy after catch up
[10:16:42] <marostegui>	 +1
[10:16:46] <jynus>	 that way we can keep the backup1001
[10:16:56] <jynus>	 plus we don't disturb it any more
[10:16:59] <kormat>	 marostegui: btw, my replacement laptop (with the correct keyboard this time!) arrived on monday :)
[10:17:03] <marostegui>	 backup1001 will be obsolete in 2 weeks (as the binlogs will expire)
[10:17:09] <marostegui>	 kormat nice!!
[10:17:24] <jynus>	 ok, we can then delete it in 2 weeks :-D
[10:17:44] <marostegui>	 yeah, let's keep it for 2 weeks and once it has caught up, send a copy to 1001
[10:17:50] <marostegui>	 so that gives us another 30 days
[10:17:59] <marostegui>	 jynus: I assume binary copy?
[10:19:05] <jynus>	 I still thinl the upgrade, not the purge threads is to blame
[10:19:23] <marostegui>	 but the host was crashing before too
[10:19:24] <marostegui>	 with 10.1
[10:19:34] <jynus>	 sorry
[10:19:45] <jynus>	 with upgrade I mostly meant "binary format"
[10:19:51] <jynus>	 which the upgrade wouldn't like
[10:19:54] <jynus>	 either
[10:19:59] <marostegui>	 but the host had crashes with 10.1
[10:20:08] <marostegui>	 Could be the storage too
[10:20:13] <jynus>	 yeah
[10:20:14] <marostegui>	 God knows
[10:20:23] <jynus>	 but I think at this time last time it had crashed
[10:20:32] <marostegui>	 Let's make a binary copy once it is finished syncing up
[10:20:32] <jynus>	 a few hours in
[10:20:36] <jynus>	 yep
[10:20:39] <marostegui>	 yeah
[10:20:47] <marostegui>	 so far it has been replicating for more than 12h
[10:20:53] <marostegui>	 as I started it yesterday at 10pm CEST or so
[10:21:02] <jynus>	 which also means labsdb1012 may be bad too
[10:21:14] <jynus>	 do you remember were you copied it from?
[10:21:19] <marostegui>	 yes, 1012
[10:21:26] <marostegui>	 always from 1012
[10:21:28] <jynus>	 no, I mean 1012
[10:21:33] <marostegui>	 ah
[10:21:37] <marostegui>	 no :(
[10:21:39] <marostegui>	 but I can check
[10:21:41] <marostegui>	 let me see if I can find it
[10:21:42] <jynus>	 it is ok if you don't, I will try to find it
[10:22:24] <jynus>	 from 11
[10:22:29] <marostegui>	 1011
[10:22:41] <marostegui>	 https://gerrit.wikimedia.org/r/#/c/494408/
[10:22:46] <marostegui>	 wow, march 2019
[10:23:00] <jynus>	 that could explain either the source or the cause
[10:23:09] <marostegui>	 we'll see once we upgrade 1012 :)
[10:24:44] <marostegui>	 I remember when we set up labsdb hosts, it took around 5 days to import everything, and now it took like 11
[10:26:27] <kormat>	 hurm. tendril seems Unhappy. it shows zero hosts in pc1
[10:26:39] <marostegui>	 ah
[10:26:41] <marostegui>	 I know why
[10:26:47] <marostegui>	 time to explore tendril DB
[10:26:49] <marostegui>	 get ready!
[10:27:03] * kormat readies his resignation later
[10:27:05] <marostegui>	 so the reason for that is that the table masters is empty (as you dropped and created that host)
[10:27:14] <marostegui>	 connect to db1115 and tendril database
[10:27:24] <jynus>	 with -A
[10:27:28] <marostegui>	 hehe yeah
[10:27:47] <marostegui>	 do a show tables for your amusement 
[10:28:16] <kormat>	 yep, it definitely has tables. ;)
[10:28:32] <marostegui>	 check the shards table
[10:28:51] <marostegui>	 as you can see, every section has a master there
[10:29:07] <marostegui>	 and as you dropped and created pc1007, that master_id isn't valid anymore
[10:29:20] <marostegui>	 so you need to update that record with the new id, which you can find on the servers table
[10:29:29] <kormat>	 it's now `1719`, from the output of the tendril scripts
[10:29:56] <jynus>	 that works but if you don't know, you can always check
[10:30:01] <marostegui>	 it matches the servers table?
[10:30:07] <jynus>	 yep
[10:30:14] <kormat>	 `update shards set master_id=1719 where name='pc1';` ?
[10:30:22] <jynus>	 master_id is missleading it is not related to replication
[10:30:28] <marostegui>	 that sounds good kormat 
[10:30:29] <jynus>	 kormat: cool, but as a tip
[10:30:29] <kormat>	 (and yes, i checked the servers table too)
[10:30:37] <marostegui>	 (cool)
[10:30:38] <jynus>	 LIMIT 1 is your friend
[10:30:47] <jynus>	 to prevent a non-where accident
[10:31:04] <jynus>	 don't worry much, we have backups anyway
[10:31:17] <jynus>	 but I use it when I can
[10:31:25] * kormat notes
[10:31:32] <marostegui>	 kormat: pc1 looking good now!
[10:31:45] <marostegui>	 you will need to do the same with pc3 once you get to it
[10:32:04] <jynus>	 note many of those are just suggestions/tip in case it helps, no commands :-D
[10:32:10] <marostegui>	 kormat: If I were you I wouldn't get to pc3 today, let's leave pc1 running 10.4 for at least 24h just in case
[10:32:18] <kormat>	 jynus: all appreciated :)
[10:32:23] <kormat>	 marostegui: WFM :)
[10:32:46] <wikibugs>	 10DBA, 10Patch-For-Review: Upgrade parsercache to Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T252182 (10Kormat) pc1 is now fully upgraded.
[10:33:02] <marostegui>	 kormat: Remember I am out tomorrow and friday, so you can either wait till monday and dig with the other tasks you've got in your plate or ask jaime to review your patches. Anything works, it is not super urgent to get pc3 done this week
[10:33:19] <jynus>	 for a review I don't mind
[10:33:22] <jynus>	 doesn't take me much
[10:33:52] <kormat>	 ok, cool.
[10:34:09] <kormat>	 marostegui: if i were going to leave it till next week, would it make sense to move pc1010 to pc3 in the meantime?
[10:34:15] <jynus>	 I find interesting that hit rate drop didn't go below 15%
[10:34:41] <marostegui>	 kormat: you can, but the problem is that we'll need to clean up those entries again, and I am not sure it is worth to save the hit rate that much
[10:34:46] <jynus>	 s/below/above/
[10:35:02] <marostegui>	 kormat: I wouldn't bother to be honest
[10:35:06] <kormat>	 marostegui: ok :)
[10:35:23] <kormat>	 ah. we don't backup parsercache, do we?
[10:35:27] <marostegui>	 nop
[10:35:33] <marostegui>	 doesn't make sense
[10:35:37] <marostegui>	 as they get purged
[10:35:41] <kormat>	 right :)
[10:36:03] <marostegui>	 On the past failover due to a mistake, we came back from codfw to eqiad with eqiad fully cold (it didn't replicate for 3 weeks)
[10:36:09] <marostegui>	 we had high latency but we survived 
[10:36:12] <kormat>	 if this was a service we had backups for, we could in theory wipe pc1010, restore a snapshot of the db onto it, and then replicate to catch up. correct?
[10:36:19] <marostegui>	 kormat: yeah
[10:36:19] <jynus>	 marostegui: I mean, it wasn't ideal
[10:36:32] <marostegui>	 jynus: I am not saying it was ideal
[10:36:36] <marostegui>	 I am saying we survived
[10:36:37] <jynus>	 :-D
[10:36:48] <kormat>	 that seems like it should be the motto of this channel
[10:36:49] <jynus>	 :-P
[10:37:46] <marostegui>	 kormat: we could even truncate pc1010 entirely, but I don't think it is worth all the stuff, moving replicationg, truncating, optimizing etc
[10:39:21] <kormat>	 it would also take up to a month to have it fill up in that case, right?
[10:39:26] <marostegui>	 yes
[10:39:52] <jynus>	 kormat: actually when we setup pc* hosts for the first time on codfw, I did exactly that
[10:40:16] <jynus>	 it was the last batch of dbs to buy when codfw was setup
[10:40:22] <kormat>	 marostegui: as you're on clinic duty, what are the appropriate tags to put on a task suggesting we drop mysql in favour of sqlite? it's web-scale
[10:40:41] <marostegui>	 kormat: Probably "Performance"
[10:40:57] <jynus>	 https://bash.toolforge.org/quip/x4-fDXIBj_Bg1xd3MKvW
[10:41:18] <marostegui>	 I am still trying to find where did I say ideal
[10:41:19] <marostegui>	 Anways
[10:41:26] <jynus>	 :-D
[10:41:26] <kormat>	 jynus: thanks! i couldn't find that tool
[10:55:53] <jynus>	 marostegui: before you go for the week, could I get a review by the end of the day? https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/595153/
[10:56:09] <marostegui>	 I will take a look
[11:53:03] <jynus>	 thanks
[12:13:57] <jynus>	 did you know that he have done over 6000 sucessful backups since the refactoring, and that we have over 20GB of backup file data?
[12:28:27] <wikibugs>	 10DBA: Disable/remove unused features on Tendril - https://phabricator.wikimedia.org/T231185 (10Marostegui)
[13:53:38] <wikibugs>	 10DBA, 10cloud-services-team (Kanban): Reimage labsdb1011 to Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T249188 (10Marostegui) s5 already sync'ed (it is always the fastest one) - the rest are progressing nicely:  `          Seconds_Behind_Master: 937736          Seconds_Behind_Master: 771949...
[14:10:58] <wikibugs>	 10DBA: Add replication lag (and other checks) to misc all hosts - https://phabricator.wikimedia.org/T237927 (10jcrespo) 05Open→03Resolved a:03jcrespo This is now fixed.
[14:11:01] <wikibugs>	 10DBA, 10observability, 10Epic, 10Patch-For-Review: Improve database alerting (tracking) - https://phabricator.wikimedia.org/T172492 (10jcrespo)
[14:42:20] <wikibugs>	 10DBA, 10observability, 10Epic, 10Patch-For-Review, 10Sustainability (Incident Prevention): Reduce false positives on database pages - https://phabricator.wikimedia.org/T177782 (10jcrespo) 05Open→03Resolved a:03jcrespo Whit this https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/595153/ I think...
[14:42:21] <wikibugs>	 10DBA, 10observability, 10Epic: Improve database alerting (tracking) - https://phabricator.wikimedia.org/T172492 (10jcrespo)
[14:46:06] <wikibugs>	 10DBA, 10observability, 10Performance-Team (Radar): Improve database application performance monitoring visibility - https://phabricator.wikimedia.org/T177778 (10jcrespo) 05Open→03Resolved a:03aaron Some actionables were done- graphite now has lag visible on grafana. And of course, we have MySQL promet...
[14:46:08] <wikibugs>	 10DBA, 10observability, 10Epic: Improve database alerting (tracking) - https://phabricator.wikimedia.org/T172492 (10jcrespo)
[14:46:12] <wikibugs>	 10DBA, 10Operations, 10observability, 10Patch-For-Review: MySQL metrics monitoring - https://phabricator.wikimedia.org/T143896 (10jcrespo)
[14:49:30] <wikibugs>	 10DBA, 10DC-Ops, 10Operations, 10ops-codfw, 10Patch-For-Review: (Need By: 31st May) rack/setup/install db213[6-9] and db2140 - https://phabricator.wikimedia.org/T251639 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` db2136.codfw.wmnet ` The log...
[14:53:31] <wikibugs>	 10DBA, 10DC-Ops, 10Operations, 10ops-codfw, 10Patch-For-Review: (Need By: 31st May) rack/setup/install db213[6-9] and db2140 - https://phabricator.wikimedia.org/T251639 (10Papaul) ` [edit interfaces interface-range vlan-private1-a-codfw]      member xe-2/0/3 { ... } +    member ge-1/0/0; [edit interfaces...
[14:53:32] <wikibugs>	 10DBA, 10observability, 10Epic: Move paging from individual databases to database service "groups" - https://phabricator.wikimedia.org/T252679 (10jcrespo)
[14:53:38] <wikibugs>	 10DBA, 10observability, 10Epic: Move paging from individual databases to database service "groups" - https://phabricator.wikimedia.org/T252679 (10jcrespo) p:05Triage→03Medium
[14:53:48] <wikibugs>	 10DBA, 10DC-Ops, 10Operations, 10ops-codfw, 10Patch-For-Review: (Need By: 31st May) rack/setup/install db213[6-9] and db2140 - https://phabricator.wikimedia.org/T251639 (10Papaul)
[15:02:01] <wikibugs>	 10DBA, 10DC-Ops, 10Operations, 10ops-codfw, 10Patch-For-Review: (Need By: 31st May) rack/setup/install db213[6-9] and db2140 - https://phabricator.wikimedia.org/T251639 (10Papaul)
[15:07:35] <wikibugs>	 10DBA, 10DC-Ops, 10Operations, 10ops-codfw, 10Patch-For-Review: (Need By: 31st May) rack/setup/install db213[6-9] and db2140 - https://phabricator.wikimedia.org/T251639 (10Papaul)
[15:13:46] <wikibugs>	 10DBA, 10DC-Ops, 10Operations, 10ops-codfw: (Need By: 31st May) rack/setup/install db213[6-9] and db2140 - https://phabricator.wikimedia.org/T251639 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db2136.codfw.wmnet'] `  and were **ALL** successful.
[15:16:21] <wikibugs>	 10DBA, 10DC-Ops, 10Operations, 10ops-codfw: (Need By: 31st May) rack/setup/install db213[6-9] and db2140 - https://phabricator.wikimedia.org/T251639 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` db2137.codfw.wmnet ` The log can be found in `/var...
[15:21:34] <wikibugs>	 10DBA, 10DC-Ops, 10Operations, 10ops-codfw: (Need By: 31st May) rack/setup/install db213[6-9] and db2140 - https://phabricator.wikimedia.org/T251639 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` db2138.codfw.wmnet ` The log can be found in `/var...
[15:26:19] <wikibugs>	 10DBA, 10DC-Ops, 10Operations, 10ops-codfw: (Need By: 31st May) rack/setup/install db213[6-9] and db2140 - https://phabricator.wikimedia.org/T251639 (10Papaul) @Marostegui  please check is this looks good on db2136  ` Disk /dev/sda: 8.7 TiB, 9598580817920 bytes, 18747228160 sectors Disk model: PERC H730P A...
[15:27:36] <wikibugs>	 10DBA, 10DC-Ops, 10Operations, 10ops-codfw: (Need By: 31st May) rack/setup/install db213[6-9] and db2140 - https://phabricator.wikimedia.org/T251639 (10Papaul)
[15:33:21] <wikibugs>	 10DBA, 10DC-Ops, 10Operations, 10ops-codfw: (Need By: 31st May) rack/setup/install db213[6-9] and db2140 - https://phabricator.wikimedia.org/T251639 (10Papaul)
[15:40:43] <wikibugs>	 10DBA, 10DC-Ops, 10Operations, 10ops-codfw: (Need By: 31st May) rack/setup/install db213[6-9] and db2140 - https://phabricator.wikimedia.org/T251639 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db2137.codfw.wmnet'] `  and were **ALL** successful.
[15:41:07] <wikibugs>	 10DBA, 10DC-Ops, 10Operations, 10ops-codfw: (Need By: 31st May) rack/setup/install db213[6-9] and db2140 - https://phabricator.wikimedia.org/T251639 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` db2139.codfw.wmnet ` The log can be found in `/var...
[15:46:03] <wikibugs>	 10DBA, 10DC-Ops, 10Operations, 10ops-codfw: (Need By: 31st May) rack/setup/install db213[6-9] and db2140 - https://phabricator.wikimedia.org/T251639 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db2138.codfw.wmnet'] `  and were **ALL** successful.
[15:47:22] <wikibugs>	 10DBA, 10DC-Ops, 10Operations, 10ops-codfw: (Need By: 31st May) rack/setup/install db213[6-9] and db2140 - https://phabricator.wikimedia.org/T251639 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` db2140.codfw.wmnet ` The log can be found in `/var...
[16:04:20] <wikibugs>	 10DBA, 10DC-Ops, 10Operations, 10ops-codfw: (Need By: 31st May) rack/setup/install db213[6-9] and db2140 - https://phabricator.wikimedia.org/T251639 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db2140.codfw.wmnet'] `  Of which those **FAILED**: ` ['db2140.codfw.wmnet'] `
[16:06:34] <wikibugs>	 10DBA, 10DC-Ops, 10Operations, 10ops-codfw: (Need By: 31st May) rack/setup/install db213[6-9] and db2140 - https://phabricator.wikimedia.org/T251639 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db2139.codfw.wmnet'] `  and were **ALL** successful.
[16:09:12] <jynus>	 did you "break" 2 disks in a row or is it a monitoring glitch?
[16:11:53] <jynus>	 "Firmware state: Rebuild"
[16:11:55] <jynus>	 mmm
[16:12:33] <jynus>	 that's weird
[16:14:18] <jynus>	 both first disks are being rebuilt
[16:15:47] <jynus>	 oh, it is not you, it is papaul
[16:22:23] <jynus>	 I've pinged him on dcops channel
[16:32:27] <wikibugs>	 10DBA, 10Epic: Upgrade WMF database-and-backup-related hosts to buster - https://phabricator.wikimedia.org/T250666 (10jcrespo) @Marostegui we had a replication breakage from m1 master (10.1) to db2078 (10.4). T251222#6133759  Thanks to the lastest improvements in misc monitoring (T237927)- literally closed min...
[16:40:13] <wikibugs>	 10DBA, 10MediaWiki-Page-derived-data, 10Performance-Team (Radar), 10Schema-change: Avoid MySQL's ENUM type, which makes keyset pagination difficult - https://phabricator.wikimedia.org/T119173 (10Krinkle) @jcrespo Would if we said "using ENUM is discouraged. Use plain integers with mapping elsewhere is pref...
[16:51:36] <wikibugs>	 10DBA, 10MediaWiki-Page-derived-data, 10Performance-Team (Radar), 10Schema-change: Avoid MySQL's ENUM type, which makes keyset pagination difficult - https://phabricator.wikimedia.org/T119173 (10jcrespo) Independently of the "strength", I think it could be missunderstood, the same way now many people think...
[16:53:02] <wikibugs>	 10DBA, 10AbuseFilter: Find a good way to run the updateVarDumps script on large wikis - https://phabricator.wikimedia.org/T252696 (10Daimona)
[16:53:15] <wikibugs>	 10DBA, 10AbuseFilter: Find a good way to run the updateVarDumps script on large wikis - https://phabricator.wikimedia.org/T252696 (10Daimona)
[17:38:52] <wikibugs>	 10DBA, 10Operations, 10Patch-For-Review: Implement logic to be able to perform full and incremental backups of ES hosts - https://phabricator.wikimedia.org/T244884 (10jcrespo) a:05jcrespo→03None
[17:44:12] <wikibugs>	 10DBA, 10DC-Ops, 10Operations, 10ops-codfw: (Need By: 31st May) rack/setup/install db213[6-9] and db2140 - https://phabricator.wikimedia.org/T251639 (10Papaul)
[17:45:07] <wikibugs>	 10DBA, 10DC-Ops, 10Operations, 10ops-codfw: (Need By: 31st May) rack/setup/install db213[6-9] and db2140 - https://phabricator.wikimedia.org/T251639 (10Papaul) 05Open→03Resolved @Marostegui complete
[18:33:39] <wikibugs>	 10DBA: Remove grants for the old dbproxy hosts from the misc databases - https://phabricator.wikimedia.org/T231280 (10Cmjohnson)
[18:33:41] <wikibugs>	 10DBA, 10Data-Services, 10Operations: Replace labsdb (wikireplicas) dbproxies: dbproxy1010 and dbproxy1011 - https://phabricator.wikimedia.org/T231520 (10Cmjohnson)
[19:03:39] <wikibugs>	 10DBA: Remove grants for the old dbproxy hosts from the misc databases - https://phabricator.wikimedia.org/T231280 (10Cmjohnson)
[19:04:44] <wikibugs>	 10DBA: Remove grants for the old dbproxy hosts from the misc databases - https://phabricator.wikimedia.org/T231280 (10Cmjohnson)
[19:08:21] <wikibugs>	 10DBA: Remove grants for the old dbproxy hosts from the misc databases - https://phabricator.wikimedia.org/T231280 (10Cmjohnson)