[02:49:28] <wikibugs>	 10DBA, 10Security-Team: Drop (and archive?) aft_feedback - https://phabricator.wikimedia.org/T250715 (10sbassett) >>! In T250715#6072786, @Marostegui wrote: > #security-team advise on whether the content of this table can be public or it needs to be redacted? Thanks!  Is it possible for the #security-team to g...
[05:10:17] <wikibugs>	 10DBA, 10Security-Team: Drop (and archive?) aft_feedback - https://phabricator.wikimedia.org/T250715 (10Marostegui) >>! In T250715#6074072, @sbassett wrote: >>>! In T250715#6072786, @Marostegui wrote: >> #security-team advise on whether the content of this table can be public or it needs to be redacted? Thanks...
[05:22:29] <wikibugs>	 10DBA, 10Schema-change: Remove image.img_deleted column from production - https://phabricator.wikimedia.org/T250055 (10Marostegui)
[05:32:56] <wikibugs>	 10DBA, 10Schema-change: Remove image.img_deleted column from production - https://phabricator.wikimedia.org/T250055 (10Marostegui) s2 eqiad  [] labsdb1012 [] labsdb1011 [] labsdb1010 [] labsdb1009 [x] dbstore1004 [] db1129 [] db1125 [] db1122 [x] db1105 [x] db1103 [x] db1095 [x] db1090 [] db1076 [] db1074
[05:43:39] <wikibugs>	 10DBA, 10Schema-change: Remove image.img_deleted column from production - https://phabricator.wikimedia.org/T250055 (10Marostegui)
[05:49:48] <wikibugs>	 10DBA, 10Schema-change: Remove image.img_deleted column from production - https://phabricator.wikimedia.org/T250055 (10Marostegui)
[06:01:29] <wikibugs>	 10DBA, 10Schema-change: Remove image.img_deleted column from production - https://phabricator.wikimedia.org/T250055 (10Marostegui) s6 eqiad  [x] labsdb1012 [x] labsdb1011 [x] labsdb1010 [x] labsdb1009 [x] dbstore1005 [x] db1139 [x] db1131 [x] db1125 [x] db1113 [x] db1098 [x] db1096 [x] db1093 [x] db1088 [x] db...
[06:01:35] <wikibugs>	 10DBA, 10Schema-change: Remove image.img_deleted column from production - https://phabricator.wikimedia.org/T250055 (10Marostegui)
[06:24:39] <wikibugs>	 10DBA, 10User-DannyS712: Drop flagged revs tables on mediawikiwiki - https://phabricator.wikimedia.org/T248298 (10Marostegui) p:05Triage→03Medium This is what the master currently has: ` root@db1112:/srv/sqldata/mediawikiwiki# ls -lh flagged* -rw-rw---- 1 mysql mysql 1.3K Nov 13  2015 flaggedimages.frm -rw...
[06:24:43] <wikibugs>	 10DBA, 10User-DannyS712: Drop flagged revs tables on mediawikiwiki - https://phabricator.wikimedia.org/T248298 (10Marostegui)
[06:24:45] <wikibugs>	 10DBA, 10Epic, 10Tracking-Neverending: Database tables to be dropped on Wikimedia wikis and other WMF databases (tracking) - https://phabricator.wikimedia.org/T54921 (10Marostegui)
[06:25:54] <wikibugs>	 10DBA, 10User-DannyS712: Drop flagged revs tables on mediawikiwiki - https://phabricator.wikimedia.org/T248298 (10Marostegui)
[06:32:45] <wikibugs>	 10DBA, 10User-DannyS712: Drop flagged revs tables on mediawikiwiki - https://phabricator.wikimedia.org/T248298 (10Marostegui) a:03Marostegui I have renamed the tables on db1075, and will drop them next week: ` root@db1075:/srv/sqldata/mediawikiwiki# ls -lh | grep flagged -rw-rw---- 1 mysql mysql 1.3K Nov 18...
[06:36:56] <wikibugs>	 10DBA, 10OTRS: OTRS database is "too large" - https://phabricator.wikimedia.org/T138915 (10Marostegui) otrs current compressed backup size: ` 417G dump.m2.2020-04-14--04-34-33/ `
[08:24:12] <wikibugs>	 10DBA, 10Dumps-Generation, 10MediaWiki-extensions-CodeReview: Publish SQL dumps of CodeReview tables - https://phabricator.wikimedia.org/T243055 (10ArielGlenn) Just a check-in to see where this is on people's radar.
[09:08:54] <mutante>	 hello DBAs. is it true you are already aware of the replica of commonswiki DB in cloud being lagged behind prod?
[09:09:06] <mutante>	 someone is asking in -cloud and says it's 13 hours 
[09:09:07] <marostegui>	 yes
[09:09:12] <mutante>	 ok, thank you marostegui 
[09:09:25] <marostegui>	 it usually lags behind
[09:09:27] <mutante>	 i told them to make a ticket. not needed ?
[09:09:39] <marostegui>	 I'm sure it is related to quarry?
[09:10:15] <mutante>	 i don't know, but i think the issue is that different people have different expectations who is "already on it"
[09:10:20] <marostegui>	 I can try to depool it and give it some rest, but not sure if it is really worth it
[09:10:42] <marostegui>	 wikireplicas usually lag
[09:11:12] <mutante>	 gotcha!
[09:11:17] <mutante>	 this sounded like new:  09:00 < don-vip> I see only files imported at least 13 hours ago, yesterday evening it was only 8 hours, the lag increased this night
[09:11:57] <marostegui>	 yes
[09:12:03] <marostegui>	 it is overloaded
[09:12:13] <marostegui>	 we are going to try to upgrade it next week and see if it helps
[09:12:25] <marostegui>	 but yes, labsdb hosts are overloaded
[09:12:41] <marostegui>	 specially quarry's one
[09:13:04] <mutante>	 thanks, i will forward that. i am just the messenger because people thought it's "just maintenance" or something
[09:13:50] <marostegui>	 thanks :*
[09:14:27] <mutante>	 yw. done. thanks too
[09:37:00] <jynus>	 I belive es* backups checks will fail
[09:37:20] <jynus>	 and the reasoning is probably lack of grants
[09:39:36] <jynus>	 also tendril is warning, and it is because the backup is quite small- I may have a fix for that later
[09:39:58] <marostegui>	 yeah, just saw that warning
[09:40:29] <jynus>	 I am thinking of calculating the average size in the last month
[09:40:40] <jynus>	 and if it varies +-15%, alert
[09:41:03] <marostegui>	 that can help, or maybe compare with the previous one just to start with something
[09:41:12] <marostegui>	 it is in zarcillo after all no?
[09:41:16] <jynus>	 yes
[09:42:13] <jynus>	 let me fix es* first, I may have to run a postprocess_only
[09:44:49] <jynus>	 [00:55:24]: ERROR - We could not connect to db1115.eqiad.wmnet to store the stats
[09:45:07] <jynus>	 "Access denied for user XXX@XXX"
[09:45:22] <jynus>	 I don't check for those errors because I alert on icinga anyway
[09:46:41] <jynus>	 and the rule is to "finish anyway, consider metadata reporting as non fatal"
[09:48:11] <jynus>	 also, despite our account handling TODOs, we have a very strict host checking policy
[09:54:50] <jynus>	 rechecking icinga...
[09:56:13] <marostegui>	 good one! :)
[10:01:24] <wikibugs>	 10DBA: Drop wb_terms in production from s4 (commonswiki, testcommonswiki), s3 (testwikidatawiki), s8 (wikidatawiki) - https://phabricator.wikimedia.org/T248086 (10Marostegui) @Ladsgroup @Addshore let's start this on Monday 27th?
[10:03:56] <wikibugs>	 10DBA: Drop wb_terms in production from s4 (commonswiki, testcommonswiki), s3 (testwikidatawiki), s8 (wikidatawiki) - https://phabricator.wikimedia.org/T248086 (10Ladsgroup) >>! In T248086#6074552, @Marostegui wrote: > @Ladsgroup @Addshore let's start this on Monday 27th?  Fine by me!
[10:21:40] <wikibugs>	 10DBA, 10Dumps-Generation, 10MediaWiki-extensions-CodeReview, 10Security-Team: Publish SQL dumps of CodeReview tables - https://phabricator.wikimedia.org/T243055 (10Marostegui) I guess this needs a review from #security-team to double check if it needs redaction or it can go as they are.
[10:31:05] <jynus>	 It is way more readable now: https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=db1115
[10:32:31] <marostegui>	 oh yeah
[10:34:25] <jynus>	 and yes, m2 keeps being larger than s1, s4 and s8 combined
[10:34:49] <marostegui>	 I commented on the OTRS ticket today
[10:34:56] <marostegui>	 and yeah, it is crazy
[10:35:19] <jynus>	 it is what happens when files are stored on db :-D
[10:35:40] <marostegui>	 :)
[10:41:18] <jynus>	 the purge lag is increasing labsdb1011
[10:41:38] <jynus>	 that means it is a long running transaction, not just general query load
[10:42:17] <jynus>	 https://grafana.wikimedia.org/d/000000273/mysql?panelId=11&fullscreen&orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=labsdb1011&var-port=9104&from=1579689729556&to=1587465729556
[10:44:00] <marostegui>	 yes, there were very long running queries
[10:44:16] <jynus>	 no, but I mean, a 1 month connection
[10:44:16] <marostegui>	 almost close to the 14400 limit
[10:44:40] <jynus>	 now 3-hour long, but days-long
[10:45:42] <marostegui>	 1 month connection?
[10:45:58] <jynus>	 not literally, it could be multiple ones
[10:46:15] <jynus>	 1000 million pending purges is really bad
[10:46:35] <jynus>	 host should be depooled, and restarted with a larger number of purging threads
[10:46:50] <jynus>	 as that will only make things slower
[10:47:16] <marostegui>	 +1 if you want to go for it
[10:49:23] <jynus>	 for example, it could be wmf-pt-kill
[12:31:11] <jynus>	 I think you may like this: https://gerrit.wikimedia.org/r/c/operations/puppet/+/591326
[12:34:08] <marostegui>	 oooh nice! 
[12:34:12] <marostegui>	 I have only read the commit message yet
[12:34:24] <jynus>	 I pasted the example output as a comment
[12:34:25] <marostegui>	 But that looks promising!
[12:34:36] <marostegui>	 And a good improvement over just alerting on space
[12:34:38] <jynus>	 Last snapshot for s3 at codfw (db2098.codfw.wmnet:3313) taken on 2020-04-20 09:03:14 is 853 GB, but previous one was 852 GB, a change of 0.2%
[12:34:56] <jynus>	 obviously I had to hardcode warn with a 0% change :-D
[12:35:01] <jynus>	 to get the error
[12:35:14] <marostegui>	 I really like that approach
[12:35:27] <jynus>	 I am not sure about just looking at the last run
[12:35:40] <jynus>	 the change from one to the other can be very small
[12:35:46] <jynus>	 specially for daily snapshots
[12:36:33] <marostegui>	 should we have a limit for snapshots?
[12:36:39] <marostegui>	 or a minimum of changing %?
[12:36:39] <jynus>	 limit?
[12:36:48] <marostegui>	 like, if the change is under X% consider that normal
[12:36:57] <jynus>	 I say that maybe it should be by date
[12:36:57] <marostegui>	 ie: s8 will drop as soon as we remove wb_terms
[12:37:04] <marostegui>	 and that's fine as it is a big change
[12:37:11] <marostegui>	 but small changes maybe normal?
[12:37:12] <jynus>	 the limit is configurable
[12:37:28] <jynus>	 with --warn-size-percentage and --crit-size-percentage
[12:37:34] <marostegui>	 ah cool
[12:37:47] <jynus>	 although it is not on puppet yet per section, that could be done later
[12:38:40] <jynus>	 but I wanted to deploy (after a few style changes) with 1% and 10%
[12:38:46] <jynus>	 and then tune later
[12:39:16] <jynus>	 maybe if later we do "compared to previous backups in the last month"
[12:39:25] <jynus>	 those should increase
[12:40:00] <jynus>	 right now the growth after compression is around 0.2% from week to week
[12:40:27] <marostegui>	 1% for warnings?
[12:40:39] <jynus>	 I know it sounds low
[12:40:55] <jynus>	 but I think it would never trigger 
[12:40:59] <jynus>	 right now
[12:42:50] <jynus>	 it is one of those things that we should try and be ready to tune I think
[12:42:57] <marostegui>	 yeah, I agree
[12:43:15] <marostegui>	 My only comment is...are we sure that won't create alerts during the long weekend?
[12:43:27] <jynus>	 oh, not sure if I will deploy now
[12:43:39] <jynus>	 but wanted to have it ready
[12:43:46] <jynus>	 (not finished yet)
[12:44:13] <marostegui>	 aaah sorry I understood you wanted to deploy now (per your earlier sentence)
[12:44:25] <jynus>	 I wanted to deploy soon
[12:44:32] <jynus>	 where soon == next week
[12:44:37] <marostegui>	 got it
[12:44:39] <jynus>	 without a lot of tuning
[12:44:51] <jynus>	 and tune next week 0:-D
[12:45:06] <jynus>	 my nows are flexible :-DDDDD
[12:45:21] <jynus>	 now as in "sooner than normal"
[12:46:41] <jynus>	 I would also have to update docs, which I am definitely not doing today
[12:47:31] <marostegui>	 haha
[12:48:38] <jynus>	 E501 line too long (101 > 100 characters)
[12:54:47] <jynus>	 I am happy today, lots of small improvements
[13:11:25] <wikibugs>	 10DBA, 10Patch-For-Review: decommission dbproxy1011.eqiad.wmnet - https://phabricator.wikimedia.org/T249590 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by marostegui@cumin1001 for hosts: `dbproxy1011.eqiad.wmnet` -  dbproxy1011.eqiad.wmnet (**PASS**)   - Downtimed host on Icinga   - Found...
[13:25:40] <wikibugs>	 10DBA, 10Patch-For-Review: decommission dbproxy1011.eqiad.wmnet - https://phabricator.wikimedia.org/T249590 (10Kormat)
[15:10:32] <kormat>	 marostegui: the expiry+optimize for pc1010 has made it to pc060 (just FYI)
[15:12:57] <marostegui>	 kormat: cool, let it run
[15:40:39] <wikibugs>	 10DBA, 10DC-Ops, 10Operations, 10ops-eqiad: db1140 (backup source) crashed - https://phabricator.wikimedia.org/T250602 (10Cmjohnson) I pulled all power plugs, reseated psu's, DIMM and CPU. The server will not power on, the LEDs are flashing orange and red.
[16:10:43] <wikibugs>	 10DBA, 10Operations, 10ops-eqiad: msw1-a6-eqiad flopping up and down mgmt connections on A6 - https://phabricator.wikimedia.org/T250652 (10Cmjohnson) Replaced the management switch, updated netbox
[16:10:46] <wikibugs>	 10DBA, 10Operations, 10ops-eqiad: msw1-a6-eqiad flopping up and down mgmt connections on A6 - https://phabricator.wikimedia.org/T250652 (10Cmjohnson) 05Open→03Resolved
[16:11:28] <wikibugs>	 10DBA, 10DC-Ops, 10Operations, 10ops-eqiad: db1140 (backup source) crashed - https://phabricator.wikimedia.org/T250602 (10Cmjohnson) @Marostegui this could be down for a bit, between HPE troubleshooting and getting a tech on-site.
[16:16:53] <wikibugs>	 10DBA, 10Operations, 10ops-eqiad: msw1-a6-eqiad flopping up and down mgmt connections on A6 - https://phabricator.wikimedia.org/T250652 (10jcrespo) Thanks, Chris, for the prompt response!
[16:20:45] <wikibugs>	 10DBA, 10DC-Ops, 10Operations, 10ops-eqiad: db1140 (backup source) crashed - https://phabricator.wikimedia.org/T250602 (10jcrespo) I will be the first contact for this server (although manuel will be around if needed, ofc :-D) @Cmjohnson we are aware- that is why migrated the service away, as it couldn't w...
[16:30:30] <wikibugs>	 10DBA, 10Security-Team: Drop (and archive?) aft_feedback - https://phabricator.wikimedia.org/T250715 (10sbassett) >>! In T250715#6074166, @Marostegui wrote: > We haven't extracted the data, does your team have access to the table itself on x1? If not, I can probably take a few lines out and leave them maybe so...
[16:41:42] <wikibugs>	 10DBA, 10Security-Team: Drop (and archive?) aft_feedback - https://phabricator.wikimedia.org/T250715 (10Reedy) >>! In T250715#6075817, @sbassett wrote: >>>! In T250715#6074166, @Marostegui wrote: >> We haven't extracted the data, does your team have access to the table itself on x1? If not, I can probably take...
[16:44:40] <wikibugs>	 10DBA, 10Security-Team: Drop (and archive?) aft_feedback - https://phabricator.wikimedia.org/T250715 (10jcrespo) > Not sure I have access to x1  I checked and you have, as you are in the deployers list. using @ mwmaint1002:  ` sql wikishared -- -e "SELECT * FROM enwiki.aft_feedback LIMIT 10" `  Should give you...
[17:01:41] <wikibugs>	 10DBA, 10Security-Team: Drop (and archive?) aft_feedback - https://phabricator.wikimedia.org/T250715 (10jcrespo) Regarding archiving, I think there is a misunderstanding about our capabilities and it relation with backups. Commenting here for awareness on similar tasks:  There is, at the moment, no proper way...
[18:24:35] <wikibugs>	 10DBA, 10Operations, 10ops-eqiad: msw1-a6-eqiad flopping up and down mgmt connections on A6 - https://phabricator.wikimedia.org/T250652 (10ayounsi) 05Resolved→03Open Thanks! Re-opening so we don't forget to update  the cable in Netbox as well.
[18:51:50] <wikibugs>	 10DBA, 10Performance-Team, 10Wikimedia-Rdbms, 10Patch-For-Review: SHOW SLAVE STATUS as a health check should have a low timeout - https://phabricator.wikimedia.org/T129093 (10Krinkle) p:05Medium→03Low
[19:40:57] <wikibugs>	 10DBA, 10Core Platform Team Workboards (Clinic Duty Team), 10mariadb-optimizer-bug: SELECT /* Title::getFirstRevision  */ sometimes using page_user_timestamp index instead of page_timestamp - https://phabricator.wikimedia.org/T236376 (10daniel) >>! In T236376#5863347, @Marostegui wrote: > For what is worth,...
[21:45:58] <wikibugs>	 10DBA, 10MediaWiki-User-management, 10Core Platform Team Workboards (Clinic Duty Team), 10Schema-change: Rename ipb_address index on ipb_address to ipb_address_unique - https://phabricator.wikimedia.org/T250071 (10Ladsgroup) Fun fact: This index is already named `ipb_address_unique` in Postgres.
[21:46:58] <wikibugs>	 10DBA, 10MediaWiki-User-management, 10Core Platform Team Workboards (Clinic Duty Team), 10Schema-change: Rename ipb_address index on ipb_address to ipb_address_unique - https://phabricator.wikimedia.org/T250071 (10Ladsgroup) Another fun fact: I absolutely hate doing schema changes in Sqlite.
[22:12:25] <Amir1>	 marostegui: The patch for index rename is up \o/ https://gerrit.wikimedia.org/r/c/mediawiki/core/+/591500 it'll remove 102 drifts