[00:03:13] 10DBA, 10Data-Services: testwiki_p.page no longer publicly viewable - https://phabricator.wikimedia.org/T264823 (10bd808) The view is there, but it seems like the rights are messed up? ` (u3518@testwiki.analytics.db.svc.eqiad.wmflabs) [testwiki_p]> show create view page\G *************************** 1. row ***... [00:13:10] 10DBA, 10Data-Services: testwiki_p.page no longer publicly viewable - https://phabricator.wikimedia.org/T264823 (10DannyS712) Hmm, not sure if this is related, but the `page_ext_reviewed`, `page_ext_stable`, and `page_ext_quality` were part of a flagged revs hack, and should no longer be used, but appear to ne... [03:02:26] PROBLEM - MariaDB sustained replica lag on db1081 is CRITICAL: 3.4 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1081&var-port=9104 [03:05:48] RECOVERY - MariaDB sustained replica lag on db1081 is OK: (C)2 ge (W)1 ge 0.8 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1081&var-port=9104 [04:46:54] 10DBA, 10Data-Services: testwiki_p.page no longer publicly viewable - https://phabricator.wikimedia.org/T264823 (10Marostegui) 05Open→03Resolved a:03Marostegui This is related to: T260111 Some columns were removed as they are not part of `tables.sql`. As this probably affected more wikis than just `testw... [05:03:30] 10DBA, 10Operations, 10Wikidata, 10serviceops: Hourly read spikes against s8 resulting in occasional user-visible latency & error spikes - https://phabricator.wikimedia.org/T264821 (10Marostegui) Adding @Addshore and @Ladsgroup as they have lots of cool dashboards (that I am unable to find) where they mayb... [07:27:43] 10DBA, 10Operations, 10ops-codfw, 10User-Kormat: db2125 crashed - mgmt iface also not available - https://phabricator.wikimedia.org/T260670 (10Marostegui) The host crashed again yesterday while loading a backup, same CPU error as always ` --------------------------------------------------------------------... [07:28:00] 10DBA, 10Data-Persistence, 10decommission-hardware, 10Patch-For-Review: decommission es2015.codfw.wmnet - https://phabricator.wikimedia.org/T264700 (10Marostegui) [07:50:00] 10DBA, 10Data-Persistence, 10decommission-hardware, 10Patch-For-Review: decommission es2015.codfw.wmnet - https://phabricator.wikimedia.org/T264700 (10Marostegui) [08:00:36] 10DBA, 10Epic: All sorts of random drifts in wikis in s3 - https://phabricator.wikimedia.org/T260111 (10Marostegui) >>! In T260111#6522279, @Ladsgroup wrote: > Non-abstract ones (half of the tables) are clean. Running on abstract tables now. \o/ [08:07:27] 10DBA, 10Operations, 10Wikidata, 10serviceops: Hourly read spikes against s8 resulting in occasional user-visible latency & error spikes - https://phabricator.wikimedia.org/T264821 (10Marostegui) I forgot to paste: ` root@mwmaint2001:~# crontab -l -uwww-data | grep -w wikidatawiki */3 * * * * echo "$$: Sta... [08:16:59] 10DBA, 10Operations, 10Wikidata, 10serviceops: Hourly read spikes against s8 resulting in occasional user-visible latency & error spikes - https://phabricator.wikimedia.org/T264821 (10Marostegui) The only common query I have found that executes on all hosts is: ` SELECT /* Wikibase\Lib\Store\Sql\Terms\Data... [09:11:40] 10DBA, 10Data-Persistence, 10Operations: db1076 crashed - BBU failure - https://phabricator.wikimedia.org/T264755 (10Marostegui) 05Open→03Resolved a:03Marostegui The table comparison came back clean. I have repooled the host. This host is a candidate master for s2, so it runs stretch and 10.1. It will... [09:11:42] 10DBA, 10Operations: db1080-95 batch possibly suffering BBU issues - https://phabricator.wikimedia.org/T258386 (10Marostegui) [10:58:12] 10DBA: Evaluate the impact of changing innodb_change_buffering to inserts - https://phabricator.wikimedia.org/T263443 (10Marostegui) Changed it on pc2009 too [10:58:29] 10DBA: Evaluate the impact of changing innodb_change_buffering to inserts - https://phabricator.wikimedia.org/T263443 (10Marostegui) [11:26:12] 10DBA, 10Epic: All sorts of random drifts in wikis in s3 - https://phabricator.wikimedia.org/T260111 (10Ladsgroup) For abstract tables: `lang=json { "change_tag change_tag_log_tag_nonuniq index-mismatch-prod-extra": { "s3": [ "db1123:advisorswiki", "db1078:advisorswiki",... [11:26:42] marostegui: Sorry [11:26:45] mwhahaha [11:26:53] ooooh comeeee ooon [11:27:13] those weren't reported earlier! [11:29:06] 10DBA, 10Epic: All sorts of random drifts in wikis in s3 - https://phabricator.wikimedia.org/T260111 (10Marostegui) [11:31:30] I assume it was failed connection or stuff like that [11:31:47] I should actually work on it to reuse the connection [11:32:15] I hate the ones that involve primary keys [11:33:14] 10DBA, 10Epic: All sorts of random drifts in wikis in s3 - https://phabricator.wikimedia.org/T260111 (10Marostegui) [11:33:57] What can possibly go wrong? [11:35:13] 10DBA, 10Epic: All sorts of random drifts in wikis in s3 - https://phabricator.wikimedia.org/T260111 (10Marostegui) [11:37:48] 10DBA, 10Operations, 10Wikidata, 10serviceops: Hourly read spikes against s8 resulting in occasional user-visible latency & error spikes - https://phabricator.wikimedia.org/T264821 (10Addshore) One of the places to look that should track all access to this term related storage is https://grafana.wikimedia.... [11:38:02] marostegui: interesting s8 spikes [11:38:11] yeah [11:38:21] That's how far I have reached [11:39:22] 10DBA, 10Epic: All sorts of random drifts in wikis in s3 - https://phabricator.wikimedia.org/T260111 (10Marostegui) [11:39:25] As is always the fun with wikidata, it could literally be anything, on any site :P [11:39:36] haha [11:39:42] marostegui: addshore actually can it be because of the index removals of wb_changes? the change dispatcher uses wb_changes quite heavily [11:39:58] (if timewise they match) [11:40:33] Let me see the timing [11:40:43] hmm, any queries that happen on that table should happen continually though, not in a spikey pattern [11:41:27] 2020-10-02 22:06:48 is the first time it appears (per the task) [11:41:31] yeah. maybe a bot is editing to much [11:41:41] and all the indexes were removed on codfw 1st Oct at around 07:00 AM UTC [11:41:47] But it was deployed before on more hosts [11:41:56] are the spike still happening now? [11:42:37] addshore: from the task it apparently showed up 2020-10-02 22:06:48 and then again yesterday at 2020-10-06 22:03:11 [11:42:51] However it looks like happens daily [11:42:55] at 22:00 UTC [11:43:38] hmmm, wait, does this line up with the train at all? [11:43:42] * addshore looks for a gerrit change [11:44:05] https://phabricator.wikimedia.org/T264821 [11:44:15] sorry: https://phabricator.wikimedia.org/F32376217 [11:44:22] so it looks daily [11:44:25] we didn't have train this week [11:44:30] things went kaboom [11:44:35] I thought it could be https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Wikibase/+/630188 but that is not merged yet (and also no train) [11:44:41] https://grafana.wikimedia.org/d/000000170/wikidata-edits?viewPanel=9&orgId=1&refresh=1m [11:45:21] Amir1: regarding the indexes, they were reported as unused by sys.unused [11:45:48] marostegui: yeah, it's unlikely, I'm just probing every possibility [11:45:51] Which indexes are we talking about? [11:46:05] addshore: https://phabricator.wikimedia.org/T264109#6507662 [11:46:05] https://phabricator.wikimedia.org/T264084 [11:46:31] it was finished the 1st oct: https://phabricator.wikimedia.org/T264109#6507662 [11:47:17] addshore: they werer reported as unused by mysql itself: https://phabricator.wikimedia.org/T262856#6490234 [11:47:37] I could certainly get a host out, add them and see if that host gets the spike today [11:47:59] addshore: yeah, basically the table had an index per column, Lucas and I checked the whole code and couldn't find anything that uses these three [11:48:09] two of them were actually used [11:48:58] (total five) [11:49:18] let me dig a bit in this [11:49:44] what puzzles me is that it happens always around the same time [11:49:58] Amir1: if needed, I can re-add them back to db2081 for instance, it is not a big table, shouldn't take long [11:50:05] and then we can see if that host gets the spikes too [11:50:33] does it happen on all hosts? [11:50:38] yep [11:50:53] not on the master of course [11:50:54] That's reallllly weird [11:51:30] Amir1: so it is a query that arrives to all hosts, that's something I checked too, to try to narrow this to a group of hosts [11:51:47] https://phabricator.wikimedia.org/T264821#6524209 [11:52:04] that's what I have seen so far [11:53:08] wbxl_language = 'lez' [11:53:10] Wat [11:53:48] I need half a litre of coffee, then I continue [11:55:10] Amir1: also happens with language = 'ja' [11:55:47] So far that's what I found as reported queries that make Handler_read_key and read_next handlers increase to a value that kinda matches the graphs values [11:55:55] and that gets run on all hosts and not only on certain groups [11:57:24] yeah, so all of these term queries still do currently run on all hosts (not grouped), we will hopfully have that solved in one way or another by the end of the year [11:58:58] 10DBA, 10Epic: All sorts of random drifts in wikis in s3 - https://phabricator.wikimedia.org/T260111 (10Marostegui) [11:59:11] cool, at least we can narrow it to that group of queries I guess, as it looks like the spikes do show on all hosts [12:02:36] 10DBA, 10Epic: All sorts of random drifts in wikis in s3 - https://phabricator.wikimedia.org/T260111 (10Marostegui) [12:03:25] https://usercontent.irccloud-cdn.com/file/CLeplMaa/image.png [12:03:36] wow, okay, so this is the resulting impact of these s8 spikes? [12:05:45] yep [12:06:17] From what chris said at https://phabricator.wikimedia.org/T264821 there were even some alerts on -operations channel [12:06:28] https://usercontent.irccloud-cdn.com/file/1wZRFYtD/image.png [12:07:28] I may have a ticket relating to this *digs* [12:07:33] see, those are the dashboards I was referring to! [12:08:28] So that is https://grafana.wikimedia.org/d/8HcRs-wZz/terms-related-wikidata-lua-function?orgId=1&from=1602020402068&to=1602023755950 [12:08:39] and could have something to do with it [12:08:55] * marostegui bookmarks it [12:09:44] but it doesnt look like that results in more sql queries on the other dashboard https://grafana.wikimedia.org/d/000000548/wikibase-sql-term-storage?orgId=1&refresh=30s&from=1602020551677&to=1602023104310 [12:10:02] it looks like it results in a bunch of cache hits [12:10:58] now, the revid is needed for the cache key, but that is also cached, and also we see spikes of larger cache hits too https://grafana.wikimedia.org/d/000000548/wikibase-sql-term-storage?orgId=1&from=1601979910134&to=1601986948479 [12:11:42] did something change on all that logic lately? [12:11:51] nope [12:12:37] Amir1: we did merge https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Wikibase/+/630173 recently, but that should just be moving stuff around, not changing logic [12:13:10] and is not deployed... [12:13:21] * addshore looks at the other timestamp [12:15:07] * addshore clutches straws, casts them into the wind and is no closer to a useful answer [12:15:47] addshore: does the reported query makes any sense with all those spikes on "your" dashboards? [12:17:07] Yes the reported query could be one of the many queries that are in this spike of access, but the fact that this spike of access didn't result in an increase of db queries to these tables makes me feel it is still unrelated [12:17:30] and looking at the spike from last week, I don't see a correlated spike via this access route [12:18:34] my gut says that this query could be a red herring, and that anything else at all that causes s8 to have more load could make these queries run slower and thus appear as issues. So we could still be looking for some other cron, cache or process that does something else on s8 at that time, that then causes the general db load to be exposed via these other queries [12:19:11] addshore: I grepped for wikidatawiki on mwmaint host, but of course it could be any other cron touching wikidata and not using it on the command line of course [12:19:33] 10DBA, 10Operations, 10Wikidata, 10serviceops: Hourly read spikes against s8 resulting in occasional user-visible latency & error spikes - https://phabricator.wikimedia.org/T264821 (10Addshore) Flagging {T263999} up here as it could be related (not a currently prooven link) [12:23:09] I also notice this? https://grafana.wikimedia.org/d/000000202/api-frontend-summary?orgId=1&from=1602020402068&to=1602023755950 REST API backend internla requests spike? [12:23:34] also happened a week ago? https://grafana.wikimedia.org/d/000000202/api-frontend-summary?orgId=1&from=1601674802000&to=1601678155000 [12:23:51] zooming out I see even more spikes https://grafana.wikimedia.org/d/000000202/api-frontend-summary?orgId=1&from=now-7d&to=now [12:24:04] I have 0 idea what this is though or thus if it may or may not be related [12:24:27] addshore: looks like it has been happening since 1st oct yeah [12:24:30] per the graphs reported [12:24:33] but that is a regulrly occouring spike at the right time [12:24:48] I'll write that and those links in the ticket? [12:25:32] addshore: yeah, daily at 22 UTC, which is weeeeeeird [12:26:22] yeah, also a correlation with RED dashboard issues on the 3rd https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&from=1601760432456&to=1601764464836 [12:28:11] 10DBA, 10Operations, 10Wikidata, 10serviceops: Hourly read spikes against s8 resulting in occasional user-visible latency & error spikes - https://phabricator.wikimedia.org/T264821 (10Addshore) ` 1:23 PM I also notice this? https://grafana.wikimedia.org/d/000000202/api-frontend-summary?orgId=1&f... [12:28:37] marostegui: any idea who to poke further to figure out what internally is making these calls? [12:31:33] addshore: Not sure if cdanis or _joe_ can help you there [12:32:03] do we believe the reads are coming from mediawiki? [12:32:07] or we still don't know [12:32:41] oh wow the REST API grah [12:32:43] okay [12:33:00] so it's just as likely it's someone *else's* hourly cronjob, sigh :) [12:33:09] ack [12:33:12] *back [12:33:46] <_joe_> so someone's calling restbase a lot? [12:34:57] <_joe_> we might look at the 5xx data to see if there is some common pattern [12:35:05] <_joe_> I start having a suspect [12:36:09] <_joe_> maybe using the envoy telemetry dashboard we might discover where those calls are coming from [12:37:14] let me dig turnilo [12:37:18] ! [12:37:30] https://w.wiki/fZj [12:37:32] it's the mobile apps! [12:37:52] it is very, very clearly the iOS app [12:38:40] iOS? Its usage is like 0.0005% [12:39:12] cdanis: wow! [12:39:16] I'm just looking at temporal correlations in turnilo [12:39:23] yeah [12:39:25] its traffic increases by 20x at the top of the hour [12:39:26] worth looking deeper [12:39:36] cdanis: but every day at 22:00 mobile apps? [12:39:46] <_joe_> https://grafana.wikimedia.org/d/b1jttnFMz/envoy-telemetry-k8s?viewPanel=4&orgId=1&from=now-24h&to=now&var-datasource=thanos&var-site=codfw&var-prometheus=k8s&var-app=mobileapps&var-destination=restbase-for-services [12:40:21] <_joe_> it's wikifeeds [12:40:22] <_joe_> https://grafana.wikimedia.org/d/b1jttnFMz/envoy-telemetry-k8s?viewPanel=4&orgId=1&from=now-24h&to=now&var-datasource=thanos&var-site=codfw&var-prometheus=k8s&var-app=wikifeeds&var-destination=restbase-for-services [12:40:23] so it didn't happen on Oct 6 [12:40:43] there is the same pattern from the mobile app traffic on Oct 5 at 22:00 [12:40:58] https://grafana.wikimedia.org/d/b1jttnFMz/envoy-telemetry-k8s?viewPanel=4&orgId=1&from=now-24h&to=now&var-datasource=thanos&var-site=codfw&var-prometheus=k8s&var-app=wikifeeds&var-destination=restbase-for-services [12:41:02] Legend [12:41:10] Amir1: it did: https://grafana.wikimedia.org/d/000000273/mysql?viewPanel=3&from=1602018059550&to=1602026851129&var-server=db2081&var-port=9104 [12:41:11] and Oct 4 [12:41:13] or at least on the slaves did [12:41:54] <_joe_> so, it's wikifeeds making a ton of requests to restbase [12:41:57] cdanis: oh, that's interesting [12:42:14] <_joe_> now we need to understand with petr if he sees a reason for that [12:42:29] <_joe_> cdanis: the traffic is coming from the public I gather [12:43:01] I wish I understood our various services and traffic flows better [12:43:13] but yes I see the wikifeeds spike in the envoy data on 22:00 on 6th and 5th Oct [12:43:29] less dramatic on 4th and 3rd Oct but still there [12:43:42] I get out of the way, let me know if you need a baseball bat [12:43:58] ahah <3 [12:44:26] <_joe_> eheh [12:44:36] <_joe_> so the flow of traffic is [12:44:39] I'm intrigued to know what this ends up boiling down to :D [12:45:03] <_joe_> app -> varnish -> restbase -> wikifeeds -> (various backends, including restbase itself) [12:45:06] 10DBA, 10Operations, 10Wikidata, 10serviceops: Hourly read spikes against s8 resulting in occasional user-visible latency & error spikes - https://phabricator.wikimedia.org/T264821 (10Ladsgroup) An update from IRC, it seems the culprit is wikifeeds: https://grafana.wikimedia.org/d/b1jttnFMz/envoy-telemetry... [12:45:15] <_joe_> restbase does not cache anything for wikifeeds [12:45:27] so what is wikifeeds doing? [12:45:33] ah okay so the mobile apps do hit wikifeeds? [12:45:36] (i dont even have a clear idea of what that service is) [12:45:36] <_joe_> addshore: aggregating data [12:45:51] <_joe_> cdanis: they call rest/api_v1/feeds/... [12:45:56] right [12:46:01] <_joe_> that is rb, mapped to wikifeeds [12:46:02] which data? :) articles renderings perhaps? [12:47:00] <_joe_> so [12:47:08] <_joe_> rb -> wikifeeds https://grafana.wikimedia.org/d/VTCkm29Wz/envoy-telemetry?viewPanel=4&orgId=1&var-datasource=codfw%20prometheus%2Fops&var-origin=restbase&var-origin_instance=All&var-destination=wikifeeds&from=1602016916904&to=1602029921913 [12:47:14] https://w.wiki/fZn [12:47:24] <_joe_> wikifeeds -> restbase https://grafana.wikimedia.org/d/b1jttnFMz/envoy-telemetry-k8s?viewPanel=4&orgId=1&from=now-24h&to=now&var-datasource=thanos&var-site=codfw&var-prometheus=k8s&var-app=wikifeeds&var-destination=restbase-for-services [12:47:36] at the top of the hour they become interested in /api/rest_v1/feed/featured/2020/10/xx, and also random pages [12:47:44] and 'onthisday' [12:48:03] onthisday is also a wikifeeds url [12:48:33] addshore: wkifeeds is basically "today's featured article", "news", etc. if you open the main page of the app [12:48:35] IIRC [12:50:05] <_joe_> yes [12:50:20] <_joe_> cdanis: I guess it's the mobile apps all fetching the new feed as they hit midnight [12:50:27] that's horrible though :) [12:50:40] sounds like a great idea to me *sarcasm sign* [12:50:43] there's an obvious fix though [12:50:53] ok I am happy to re-write the bug to point to the iOS mobile app, unless anyone else is interested [12:50:53] get all the users to distribute uniformly across the globe [12:50:58] how on earth we haven't cached the bleep out of this [12:51:06] Amir1: *that* is also an excellent question tbh [12:51:30] <_joe_> ok so [12:51:32] i still have no idea how this causes s8 to be impacted [12:51:36] I think really it's two bugs [12:51:45] <_joe_> I think it is cached for N minutes at the edge [12:51:50] <_joe_> it should be [12:52:02] looking at turnilo it *does* get cached [12:52:04] so wtf [12:52:07] <_joe_> cdanis: I guess it's the mobile apps all fetching the new feed as they hit midnight -> 🤦 [12:52:08] <_joe_> but first we might ask the mobile apps devs if my hypothesis is correct [12:52:11] for more excellent questions, don't forget to subscribe and hit the button to be notified [12:52:15] yeah I'm less sure now [12:52:16] XDD [12:52:34] <_joe_> cdanis: of what? [12:52:49] there *is* a traffic spike from mobileapps towards wikifeeds, but, it also gets cached [12:53:20] <_joe_> so why it still makes so many calls? [12:53:28] that is also an excellent question! [12:53:34] <_joe_> I guess you can multiply the number of cache nodes by the number of different wikis [12:53:36] I think I have exceeded what I can do with turnilo here [12:53:55] <_joe_> and I'm not sure how uniform the urls are [12:53:57] but still, if this was smeared by a few more minutes in the app itself, I think it would help [12:54:03] <_joe_> if every user can craft theirs basically [12:54:07] <_joe_> yes [12:54:15] <_joe_> the issue there is [12:54:20] <_joe_> you need people to update [12:54:33] it's iOS so I assume most people will get the autoupdate from the app store [12:54:37] <_joe_> but yes, we need to contact the mobile app team [12:54:41] <_joe_> it's only iOS? [12:54:43] <_joe_> good [12:54:46] it is only iOS, AFAICT [12:55:04] <_joe_> ok let's open a task? [12:55:19] yes, I will do so [12:55:26] <_joe_> or update the current one and add the necessary tags [12:55:34] size of traffic from ios is tiny [12:55:37] <_joe_> maybe it's good to make a subtask already? [12:55:47] Amir1: but it's not tiny in the moment [12:56:05] that's weird [12:56:21] <_joe_> Amir1: imagine a world where all iphones in europe download the daily feed at the same time [12:56:27] Amir1: according to turnilo, in aggregate the iOS app makes over (886+281+115+33+25)*128 requests in *the one minute* of 6 Oct 22:00 [12:56:29] hahaha [12:56:48] so, the output of https://en.wikipedia.org/api/rest_v1/feed/featured/2020/10/02 for example does include "wikibase_item" which relates to wikidata, and I also see "description" and either local or central, im guessing central is wikidata [12:56:48] so that's 3k rps of iOS traffic [12:56:52] thus, it hits s8 :) [12:56:56] and a bunch of it is probably synchronized to the start of the minute [12:57:00] all apps together are 1% https://w.wiki/fZt [12:57:04] because we've gotten too good at clocks [12:57:11] _joe_: haha [12:57:45] <_joe_> cdanis: yeah all that ntp nonsense [12:57:47] While i was browsing around I also saw spikes in action=query in grafana, but couldn't dive much deeper, but I guess wikifeeds calls that, and some of the wikibase api modules there Amir1 [12:58:03] <_joe_> addshore: https://grafana.wikimedia.org/d/b1jttnFMz/envoy-telemetry-k8s?viewPanel=4&orgId=1&from=now-24h&to=now&var-datasource=thanos&var-site=codfw&var-prometheus=k8s&var-app=wikifeeds&var-destination=mwapi-async [12:58:15] <_joe_> mwapi-async is the mw api [12:58:27] which I guess is https://en.wikipedia.org/w/api.php?action=help&recursivesubmodules=1#query+pageterms Amir1 [12:58:29] hmm addshore maybe we can coordinate with them to improve their API calls or build something suited to their needs? [12:58:34] 10DBA, 10Epic: All sorts of random drifts in wikis in s3 - https://phabricator.wikimedia.org/T260111 (10Marostegui) [12:58:34] <_joe_> jayme: <3 for making the k8s version [12:58:51] addshore: aaah, we don't cache pageterms [12:58:52] Amir1: yeah, im not sure how much caching pageterms currently has? [12:59:02] I wanted to do it, you didn't let me [12:59:04] :D [12:59:06] Amir1: there you go, so thats why the spikes end up getting all the way down to s8? :P [12:59:11] Amir1: well, now we have a reason to :D [12:59:13] * jayme peeks in [12:59:26] Amir1: but we should confirm that they actually use / call it first [12:59:29] <_joe_> so we can blame addshore [12:59:31] <_joe_> great [12:59:34] _joe_: <3 [12:59:39] <_joe_> cdanis: let's close the incident [12:59:48] <_joe_> we found a scapegoat [12:59:48] * addshore runs to a call [12:59:59] the scapegoat is escaping [13:01:01] ack [13:01:08] I am getting a bit more coffee and then finishing writing a subtask [13:01:20] Thanks. [13:01:41] 10DBA, 10Epic: All sorts of random drifts in wikis in s3 - https://phabricator.wikimedia.org/T260111 (10Marostegui) [13:01:50] <_joe_> thanks cdanis :)) [13:02:04] thanks guys :* [13:03:17] 10DBA, 10Operations, 10Wikidata, 10serviceops: Hourly read spikes against s8 resulting in occasional user-visible latency & error spikes - https://phabricator.wikimedia.org/T264821 (10Addshore) ` 1:57 PM While i was browsing around I also saw spikes in action=query in grafana, but couldn't dive... [13:03:24] Amir1: I added notes form our chat about pageterms there too [13:03:33] Thanks <3 [13:03:40] I'm trying to find their codebase [13:04:39] I found this? https://gerrit.wikimedia.org/g/mediawiki/services/restbase/+/32951a31c32aaf92714b04b601c64546d86b599b/v1/summary.yaml#124 [13:05:06] looks like the coconut [13:06:23] that's interesting [13:06:53] I guess the feed thing uses this extracts thing [13:09:47] https://github.com/wikimedia/wikifeeds I can't find anything there [13:11:26] 10DBA, 10Epic: All sorts of random drifts in wikis in s3 - https://phabricator.wikimedia.org/T260111 (10Marostegui) [13:11:50] 10DBA, 10Operations, 10Wikidata, 10serviceops: Hourly read spikes against s8 resulting in occasional user-visible latency & error spikes - https://phabricator.wikimedia.org/T264821 (10Joe) To clarify a bit - restbase has hourly spikes of requests for the `feed` endpoint, which go back to wikifeeds, which c... [13:16:17] why now? [13:21:52] <_joe_> Amir1: I suspect that moving some composition from restbase to wikifeeds made the problem more evident [13:23:17] <_joe_> the problem was present before btw [13:25:57] ack [13:46:59] Amir1: indeed, but there is this extract thing? https://github.com/wikimedia/wikifeeds/search?q=extract [13:47:12] im sureits connected, but don't understand exactly by which code [13:50:39] addshore: oh, the new search in commons uses pageterms API too https://gerrit.wikimedia.org/g/mediawiki/extensions/WikibaseMediaInfo/+/bcb422d89b84f12fe9927235f65ac79e7997895d/resources/mediasearch-vue/components/SearchResults.vue [13:52:56] Is that with the context of mediainfo though? or items? (must be mediainfo)? [14:00:55] 10DBA, 10Epic: All sorts of random drifts in wikis in s3 - https://phabricator.wikimedia.org/T260111 (10Marostegui) [14:56:06] https://usercontent.irccloud-cdn.com/file/OYLe48TO/image.png [14:56:07] https://www.mediawiki.org/wiki/Topic:Vv9bnk8pad4zwree [14:56:32] Anyone aware offhand of cases where ALTER TABLE drop index, create index; might not work properly? [15:03:20] what do you mean Reedy? there is a bug if you use "if exists" or "if not exists" [15:03:33] not sure if you are using it [15:03:54] nope [15:05:01] for the record: https://jira.mariadb.org/browse/MDEV-8351 [15:05:15] they're using oracle mysql I think [15:05:23] but they get errors with [15:05:24] ALTER TABLE /*_*/ipblocks DROP INDEX /*i*/ipb_address_unique, ADD UNIQUE INDEX /*i*/ipb_address_unique (ipb_address(255), ipb_user, ipb_auto); [15:05:25] but [15:05:35] DROP INDEX /*i*/ipb_address ON /*_*/ipblocks; [15:05:36] CREATE UNIQUE INDEX /*i*/ipb_address_unique ON /*_*/ipblocks (ipb_address(255), ipb_user, ipb_auto); [15:05:37] is fine [15:06:05] oh, hang on [15:06:22] that's not obviously equivalent [15:06:32] but... I know the first one works fine locally... Doesn't on some other versions [15:07:02] other mysql versions [15:07:44] yeah works for me on mariadb 10.1, 10.3 and 10.4 [15:08:13] I can spin up a mysql 8 instance tomorrow on my lab and test if you like [15:08:50] I'd have to check the thread to see what version they say they're using [15:11:08] >The MySQL version is 5.7.31 [15:12:24] We've not had complaints of similar patches before [15:17:29] marostegui: I think I couldn't see the wood for the trees here [15:17:32] The patch is literally wrong [15:17:35] *original [15:17:50] It wasn't dropping the index it was supposed to, so the guarding didn't work [15:18:34] riiiight [15:22:04] Or... [15:22:09] Maybe not [15:24:07] Or maybe it is the first issue [15:24:34] No, the patch is right [15:24:49] while it makes no literal sense, it should be dropping ipb_address_unique and recreating it with a slightly different signature [15:25:32] have they tried doing it in two different transactions? [15:29:25] >...skipping update to shared table ipblocks. [15:29:31] That's... probably the reason [15:29:31] SIGH [15:29:56] Ok, now I know the proper fix [15:31:16] And why it's failing like it does [15:32:17] why is it failing? [15:32:29] https://github.com/wikimedia/mediawiki/commit/39448a9cf6542883a89820c7fd0481c1c79d281e#diff-00fd0e892ac6e44456a418054b1a61d9 [15:32:34] So mediawiki has a thing called shared tables [15:32:44] and to update "shared tables" you have to pass a flag to update.php [15:32:53] as ipblocks is a (potential) shared table [15:33:10] this means that this following schema change isn't being done [15:33:16] [ 'renameIndex', 'ipblocks', 'ipb_address', 'ipb_address_unique', false, 'patch-ipblocks-rename-ipb_address.sql' ], [15:33:23] which means the ipb_address index doesn't become ipb_address_unique [15:33:43] doFixIpbAddressUniqueIndex was added in the commit above, doesn't check if it should be updating a shared table [15:34:07] Actually, no... [15:34:09] wtf is going on [15:34:15] hahahah [15:34:19] The added indexHasField has the doTable [15:35:04] but there are two different things here, no? 1) the sql statement failing 2) MW failing to do its magic [15:35:21] 1) does work on mariadb 10.1, 10.3 and 10.4 at least [15:35:22] the sql statement is going to fail, because the index doesn't exist [15:35:38] ah cool [15:35:45] then that's "ok" yeah [15:35:56] it looks like indexHasField isn't doing as expected [15:43:38] AH [15:43:41] FOUND YOU [15:43:42] if ( !$this->doTable( $table ) ) { [15:43:42] return true; [15:43:42] } [15:43:45] if ( $row->Column_name == $field ) { [15:43:45] $this->output( "...index $index on table $table includes field $field.\n" ); [15:43:45] return true; [15:43:46] } [15:43:49] it's the wrong fscking true [15:44:47] if ( !$this->indexHasField( 'ipblocks', 'ipb_address_unique', 'ipb_anon_only' ) ) { [15:44:47] $this->output( "...ipb_address_unique index up-to-date.\n" ); [15:44:47] return; [15:44:47] } [15:45:06] We want to update the index *IF* the field exists in the index [15:45:08] not if it doesn't [15:45:10] 10DBA, 10Data-Services: Prepare and check storage layer for smnwiki - https://phabricator.wikimedia.org/T264900 (10Urbanecm) [15:55:23] https://gerrit.wikimedia.org/r/632749 [15:55:24] unclear logic [15:57:36] basically... it was giving the same value if the update was to a shared table, or the index exists [15:57:40] and both dont' mean the same thing [15:58:10] * Reedy stab [15:58:11] s [19:03:05] 10DBA, 10Community-Tech, 10Expiring-Watchlist-Items: Watchlist Expiry: Release plan [rough schedule] - https://phabricator.wikimedia.org/T261005 (10ifried) [19:03:26] 10DBA, 10Community-Tech, 10Expiring-Watchlist-Items: Watchlist Expiry: Release plan [rough schedule] - https://phabricator.wikimedia.org/T261005 (10ifried) [19:04:11] 10DBA, 10Community-Tech, 10Expiring-Watchlist-Items: Watchlist Expiry: Release plan [rough schedule] - https://phabricator.wikimedia.org/T261005 (10ifried) [19:09:17] 10DBA, 10Community-Tech, 10Expiring-Watchlist-Items: Watchlist Expiry: Release plan [rough schedule] - https://phabricator.wikimedia.org/T261005 (10ifried) @Marostegui Hello! We are planning to enable the feature on Persian, French, German, and Czech Wikipedia on Tuesday, October 13. You can follow the relea... [19:11:50] 10DBA, 10Community-Tech, 10Expiring-Watchlist-Items: Watchlist Expiry: Release plan [rough schedule] - https://phabricator.wikimedia.org/T261005 (10ifried)