[02:49:28] 10DBA, 10Security-Team: Drop (and archive?) aft_feedback - https://phabricator.wikimedia.org/T250715 (10sbassett) >>! In T250715#6072786, @Marostegui wrote: > #security-team advise on whether the content of this table can be public or it needs to be redacted? Thanks! Is it possible for the #security-team to g... [05:10:17] 10DBA, 10Security-Team: Drop (and archive?) aft_feedback - https://phabricator.wikimedia.org/T250715 (10Marostegui) >>! In T250715#6074072, @sbassett wrote: >>>! In T250715#6072786, @Marostegui wrote: >> #security-team advise on whether the content of this table can be public or it needs to be redacted? Thanks... [05:22:29] 10DBA, 10Schema-change: Remove image.img_deleted column from production - https://phabricator.wikimedia.org/T250055 (10Marostegui) [05:32:56] 10DBA, 10Schema-change: Remove image.img_deleted column from production - https://phabricator.wikimedia.org/T250055 (10Marostegui) s2 eqiad [] labsdb1012 [] labsdb1011 [] labsdb1010 [] labsdb1009 [x] dbstore1004 [] db1129 [] db1125 [] db1122 [x] db1105 [x] db1103 [x] db1095 [x] db1090 [] db1076 [] db1074 [05:43:39] 10DBA, 10Schema-change: Remove image.img_deleted column from production - https://phabricator.wikimedia.org/T250055 (10Marostegui) [05:49:48] 10DBA, 10Schema-change: Remove image.img_deleted column from production - https://phabricator.wikimedia.org/T250055 (10Marostegui) [06:01:29] 10DBA, 10Schema-change: Remove image.img_deleted column from production - https://phabricator.wikimedia.org/T250055 (10Marostegui) s6 eqiad [x] labsdb1012 [x] labsdb1011 [x] labsdb1010 [x] labsdb1009 [x] dbstore1005 [x] db1139 [x] db1131 [x] db1125 [x] db1113 [x] db1098 [x] db1096 [x] db1093 [x] db1088 [x] db... [06:01:35] 10DBA, 10Schema-change: Remove image.img_deleted column from production - https://phabricator.wikimedia.org/T250055 (10Marostegui) [06:24:39] 10DBA, 10User-DannyS712: Drop flagged revs tables on mediawikiwiki - https://phabricator.wikimedia.org/T248298 (10Marostegui) p:05Triage→03Medium This is what the master currently has: ` root@db1112:/srv/sqldata/mediawikiwiki# ls -lh flagged* -rw-rw---- 1 mysql mysql 1.3K Nov 13 2015 flaggedimages.frm -rw... [06:24:43] 10DBA, 10User-DannyS712: Drop flagged revs tables on mediawikiwiki - https://phabricator.wikimedia.org/T248298 (10Marostegui) [06:24:45] 10DBA, 10Epic, 10Tracking-Neverending: Database tables to be dropped on Wikimedia wikis and other WMF databases (tracking) - https://phabricator.wikimedia.org/T54921 (10Marostegui) [06:25:54] 10DBA, 10User-DannyS712: Drop flagged revs tables on mediawikiwiki - https://phabricator.wikimedia.org/T248298 (10Marostegui) [06:32:45] 10DBA, 10User-DannyS712: Drop flagged revs tables on mediawikiwiki - https://phabricator.wikimedia.org/T248298 (10Marostegui) a:03Marostegui I have renamed the tables on db1075, and will drop them next week: ` root@db1075:/srv/sqldata/mediawikiwiki# ls -lh | grep flagged -rw-rw---- 1 mysql mysql 1.3K Nov 18... [06:36:56] 10DBA, 10OTRS: OTRS database is "too large" - https://phabricator.wikimedia.org/T138915 (10Marostegui) otrs current compressed backup size: ` 417G dump.m2.2020-04-14--04-34-33/ ` [08:24:12] 10DBA, 10Dumps-Generation, 10MediaWiki-extensions-CodeReview: Publish SQL dumps of CodeReview tables - https://phabricator.wikimedia.org/T243055 (10ArielGlenn) Just a check-in to see where this is on people's radar. [09:08:54] hello DBAs. is it true you are already aware of the replica of commonswiki DB in cloud being lagged behind prod? [09:09:06] someone is asking in -cloud and says it's 13 hours [09:09:07] yes [09:09:12] ok, thank you marostegui [09:09:25] it usually lags behind [09:09:27] i told them to make a ticket. not needed ? [09:09:39] I'm sure it is related to quarry? [09:10:15] i don't know, but i think the issue is that different people have different expectations who is "already on it" [09:10:20] I can try to depool it and give it some rest, but not sure if it is really worth it [09:10:42] wikireplicas usually lag [09:11:12] gotcha! [09:11:17] this sounded like new: 09:00 < don-vip> I see only files imported at least 13 hours ago, yesterday evening it was only 8 hours, the lag increased this night [09:11:57] yes [09:12:03] it is overloaded [09:12:13] we are going to try to upgrade it next week and see if it helps [09:12:25] but yes, labsdb hosts are overloaded [09:12:41] specially quarry's one [09:13:04] thanks, i will forward that. i am just the messenger because people thought it's "just maintenance" or something [09:13:50] thanks :* [09:14:27] yw. done. thanks too [09:37:00] I belive es* backups checks will fail [09:37:20] and the reasoning is probably lack of grants [09:39:36] also tendril is warning, and it is because the backup is quite small- I may have a fix for that later [09:39:58] yeah, just saw that warning [09:40:29] I am thinking of calculating the average size in the last month [09:40:40] and if it varies +-15%, alert [09:41:03] that can help, or maybe compare with the previous one just to start with something [09:41:12] it is in zarcillo after all no? [09:41:16] yes [09:42:13] let me fix es* first, I may have to run a postprocess_only [09:44:49] [00:55:24]: ERROR - We could not connect to db1115.eqiad.wmnet to store the stats [09:45:07] "Access denied for user XXX@XXX" [09:45:22] I don't check for those errors because I alert on icinga anyway [09:46:41] and the rule is to "finish anyway, consider metadata reporting as non fatal" [09:48:11] also, despite our account handling TODOs, we have a very strict host checking policy [09:54:50] rechecking icinga... [09:56:13] good one! :) [10:01:24] 10DBA: Drop wb_terms in production from s4 (commonswiki, testcommonswiki), s3 (testwikidatawiki), s8 (wikidatawiki) - https://phabricator.wikimedia.org/T248086 (10Marostegui) @Ladsgroup @Addshore let's start this on Monday 27th? [10:03:56] 10DBA: Drop wb_terms in production from s4 (commonswiki, testcommonswiki), s3 (testwikidatawiki), s8 (wikidatawiki) - https://phabricator.wikimedia.org/T248086 (10Ladsgroup) >>! In T248086#6074552, @Marostegui wrote: > @Ladsgroup @Addshore let's start this on Monday 27th? Fine by me! [10:21:40] 10DBA, 10Dumps-Generation, 10MediaWiki-extensions-CodeReview, 10Security-Team: Publish SQL dumps of CodeReview tables - https://phabricator.wikimedia.org/T243055 (10Marostegui) I guess this needs a review from #security-team to double check if it needs redaction or it can go as they are. [10:31:05] It is way more readable now: https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=db1115 [10:32:31] oh yeah [10:34:25] and yes, m2 keeps being larger than s1, s4 and s8 combined [10:34:49] I commented on the OTRS ticket today [10:34:56] and yeah, it is crazy [10:35:19] it is what happens when files are stored on db :-D [10:35:40] :) [10:41:18] the purge lag is increasing labsdb1011 [10:41:38] that means it is a long running transaction, not just general query load [10:42:17] https://grafana.wikimedia.org/d/000000273/mysql?panelId=11&fullscreen&orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=labsdb1011&var-port=9104&from=1579689729556&to=1587465729556 [10:44:00] yes, there were very long running queries [10:44:16] no, but I mean, a 1 month connection [10:44:16] almost close to the 14400 limit [10:44:40] now 3-hour long, but days-long [10:45:42] 1 month connection? [10:45:58] not literally, it could be multiple ones [10:46:15] 1000 million pending purges is really bad [10:46:35] host should be depooled, and restarted with a larger number of purging threads [10:46:50] as that will only make things slower [10:47:16] +1 if you want to go for it [10:49:23] for example, it could be wmf-pt-kill [12:31:11] I think you may like this: https://gerrit.wikimedia.org/r/c/operations/puppet/+/591326 [12:34:08] oooh nice! [12:34:12] I have only read the commit message yet [12:34:24] I pasted the example output as a comment [12:34:25] But that looks promising! [12:34:36] And a good improvement over just alerting on space [12:34:38] Last snapshot for s3 at codfw (db2098.codfw.wmnet:3313) taken on 2020-04-20 09:03:14 is 853 GB, but previous one was 852 GB, a change of 0.2% [12:34:56] obviously I had to hardcode warn with a 0% change :-D [12:35:01] to get the error [12:35:14] I really like that approach [12:35:27] I am not sure about just looking at the last run [12:35:40] the change from one to the other can be very small [12:35:46] specially for daily snapshots [12:36:33] should we have a limit for snapshots? [12:36:39] or a minimum of changing %? [12:36:39] limit? [12:36:48] like, if the change is under X% consider that normal [12:36:57] I say that maybe it should be by date [12:36:57] ie: s8 will drop as soon as we remove wb_terms [12:37:04] and that's fine as it is a big change [12:37:11] but small changes maybe normal? [12:37:12] the limit is configurable [12:37:28] with --warn-size-percentage and --crit-size-percentage [12:37:34] ah cool [12:37:47] although it is not on puppet yet per section, that could be done later [12:38:40] but I wanted to deploy (after a few style changes) with 1% and 10% [12:38:46] and then tune later [12:39:16] maybe if later we do "compared to previous backups in the last month" [12:39:25] those should increase [12:40:00] right now the growth after compression is around 0.2% from week to week [12:40:27] 1% for warnings? [12:40:39] I know it sounds low [12:40:55] but I think it would never trigger [12:40:59] right now [12:42:50] it is one of those things that we should try and be ready to tune I think [12:42:57] yeah, I agree [12:43:15] My only comment is...are we sure that won't create alerts during the long weekend? [12:43:27] oh, not sure if I will deploy now [12:43:39] but wanted to have it ready [12:43:46] (not finished yet) [12:44:13] aaah sorry I understood you wanted to deploy now (per your earlier sentence) [12:44:25] I wanted to deploy soon [12:44:32] where soon == next week [12:44:37] got it [12:44:39] without a lot of tuning [12:44:51] and tune next week 0:-D [12:45:06] my nows are flexible :-DDDDD [12:45:21] now as in "sooner than normal" [12:46:41] I would also have to update docs, which I am definitely not doing today [12:47:31] haha [12:48:38] E501 line too long (101 > 100 characters) [12:54:47] I am happy today, lots of small improvements [13:11:25] 10DBA, 10Patch-For-Review: decommission dbproxy1011.eqiad.wmnet - https://phabricator.wikimedia.org/T249590 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by marostegui@cumin1001 for hosts: `dbproxy1011.eqiad.wmnet` - dbproxy1011.eqiad.wmnet (**PASS**) - Downtimed host on Icinga - Found... [13:25:40] 10DBA, 10Patch-For-Review: decommission dbproxy1011.eqiad.wmnet - https://phabricator.wikimedia.org/T249590 (10Kormat) [15:10:32] marostegui: the expiry+optimize for pc1010 has made it to pc060 (just FYI) [15:12:57] kormat: cool, let it run [15:40:39] 10DBA, 10DC-Ops, 10Operations, 10ops-eqiad: db1140 (backup source) crashed - https://phabricator.wikimedia.org/T250602 (10Cmjohnson) I pulled all power plugs, reseated psu's, DIMM and CPU. The server will not power on, the LEDs are flashing orange and red. [16:10:43] 10DBA, 10Operations, 10ops-eqiad: msw1-a6-eqiad flopping up and down mgmt connections on A6 - https://phabricator.wikimedia.org/T250652 (10Cmjohnson) Replaced the management switch, updated netbox [16:10:46] 10DBA, 10Operations, 10ops-eqiad: msw1-a6-eqiad flopping up and down mgmt connections on A6 - https://phabricator.wikimedia.org/T250652 (10Cmjohnson) 05Open→03Resolved [16:11:28] 10DBA, 10DC-Ops, 10Operations, 10ops-eqiad: db1140 (backup source) crashed - https://phabricator.wikimedia.org/T250602 (10Cmjohnson) @Marostegui this could be down for a bit, between HPE troubleshooting and getting a tech on-site. [16:16:53] 10DBA, 10Operations, 10ops-eqiad: msw1-a6-eqiad flopping up and down mgmt connections on A6 - https://phabricator.wikimedia.org/T250652 (10jcrespo) Thanks, Chris, for the prompt response! [16:20:45] 10DBA, 10DC-Ops, 10Operations, 10ops-eqiad: db1140 (backup source) crashed - https://phabricator.wikimedia.org/T250602 (10jcrespo) I will be the first contact for this server (although manuel will be around if needed, ofc :-D) @Cmjohnson we are aware- that is why migrated the service away, as it couldn't w... [16:30:30] 10DBA, 10Security-Team: Drop (and archive?) aft_feedback - https://phabricator.wikimedia.org/T250715 (10sbassett) >>! In T250715#6074166, @Marostegui wrote: > We haven't extracted the data, does your team have access to the table itself on x1? If not, I can probably take a few lines out and leave them maybe so... [16:41:42] 10DBA, 10Security-Team: Drop (and archive?) aft_feedback - https://phabricator.wikimedia.org/T250715 (10Reedy) >>! In T250715#6075817, @sbassett wrote: >>>! In T250715#6074166, @Marostegui wrote: >> We haven't extracted the data, does your team have access to the table itself on x1? If not, I can probably take... [16:44:40] 10DBA, 10Security-Team: Drop (and archive?) aft_feedback - https://phabricator.wikimedia.org/T250715 (10jcrespo) > Not sure I have access to x1 I checked and you have, as you are in the deployers list. using @ mwmaint1002: ` sql wikishared -- -e "SELECT * FROM enwiki.aft_feedback LIMIT 10" ` Should give you... [17:01:41] 10DBA, 10Security-Team: Drop (and archive?) aft_feedback - https://phabricator.wikimedia.org/T250715 (10jcrespo) Regarding archiving, I think there is a misunderstanding about our capabilities and it relation with backups. Commenting here for awareness on similar tasks: There is, at the moment, no proper way... [18:24:35] 10DBA, 10Operations, 10ops-eqiad: msw1-a6-eqiad flopping up and down mgmt connections on A6 - https://phabricator.wikimedia.org/T250652 (10ayounsi) 05Resolved→03Open Thanks! Re-opening so we don't forget to update the cable in Netbox as well. [18:51:50] 10DBA, 10Performance-Team, 10Wikimedia-Rdbms, 10Patch-For-Review: SHOW SLAVE STATUS as a health check should have a low timeout - https://phabricator.wikimedia.org/T129093 (10Krinkle) p:05Medium→03Low [19:40:57] 10DBA, 10Core Platform Team Workboards (Clinic Duty Team), 10mariadb-optimizer-bug: SELECT /* Title::getFirstRevision */ sometimes using page_user_timestamp index instead of page_timestamp - https://phabricator.wikimedia.org/T236376 (10daniel) >>! In T236376#5863347, @Marostegui wrote: > For what is worth,... [21:45:58] 10DBA, 10MediaWiki-User-management, 10Core Platform Team Workboards (Clinic Duty Team), 10Schema-change: Rename ipb_address index on ipb_address to ipb_address_unique - https://phabricator.wikimedia.org/T250071 (10Ladsgroup) Fun fact: This index is already named `ipb_address_unique` in Postgres. [21:46:58] 10DBA, 10MediaWiki-User-management, 10Core Platform Team Workboards (Clinic Duty Team), 10Schema-change: Rename ipb_address index on ipb_address to ipb_address_unique - https://phabricator.wikimedia.org/T250071 (10Ladsgroup) Another fun fact: I absolutely hate doing schema changes in Sqlite. [22:12:25] marostegui: The patch for index rename is up \o/ https://gerrit.wikimedia.org/r/c/mediawiki/core/+/591500 it'll remove 102 drifts