[01:27:32] 10DBA, 10MediaWiki-Database, 10Operations: Preserve InnoDB table auto_increment on restart - https://phabricator.wikimedia.org/T135851 (10TTO) [02:25:13] 10DBA, 10MediaWiki-Database, 10Operations: Preserve InnoDB table auto_increment on restart - https://phabricator.wikimedia.org/T135851 (10GeoffreyT2000) >>! In T135851#2312924, @jcrespo wrote: > -1 disagreeing with the solution. > > This can happen also on master failover-which your solution will not protec... [04:32:13] 10DBA, 10MediaWiki-Database, 10Operations: Preserve InnoDB table auto_increment on restart - https://phabricator.wikimedia.org/T135851 (10GeoffreyT2000) 05Open>03Resolved a:03GeoffreyT2000 This is more of a MySQL bug than a MediaWiki bug. Anyway, this bug was actually originally reported in 2003 as [[h... [05:25:37] 10DBA, 10Operations, 10ops-codfw: pc2006 rebooted itself - https://phabricator.wikimedia.org/T200641 (10Marostegui) a:03Papaul Looks like it had some memory errors: ``` /admin1/system1/logs1/log1-> show record1 properties CreationTimestamp = 20180729082203.000000-300 ElementName = System Event Log En... [05:26:52] 10Blocked-on-schema-change, 10DBA, 10Patch-For-Review, 10Schema-change: Truncate SHA-1 indexes - https://phabricator.wikimedia.org/T51190 (10Marostegui) [05:27:06] 10Blocked-on-schema-change, 10DBA, 10Wikidata, 10Patch-For-Review, 10Schema-change: Drop eu_touched in production - https://phabricator.wikimedia.org/T144010 (10Marostegui) [05:27:20] 10DBA, 10Patch-For-Review, 10Schema-change: Convert UNIQUE INDEX to PK in Production - https://phabricator.wikimedia.org/T199368 (10Marostegui) [05:34:21] 10DBA, 10Operations, 10Epic: DB meta task for next DC failover issues - https://phabricator.wikimedia.org/T189107 (10Marostegui) [05:34:34] 10DBA, 10Operations, 10Patch-For-Review: mysql user and group should be a system user/group - https://phabricator.wikimedia.org/T100501 (10Marostegui) [05:57:20] 10DBA, 10Wikimedia-General-or-Unknown: Database error while saving an article (1205 Lock wait exceeded in Title::invalidateCache) - https://phabricator.wikimedia.org/T133185 (10Marostegui) 05Open>03Resolved I am going to close this as resolved as it looked a one time thing, and as specified here T133185#22... [06:27:05] On 10.3 :) [06:27:08] *Oh [06:27:44] I am going to test it on old sanitariums [06:28:13] it is larger because it includes rocksdb and mariabackup [06:28:41] Good opportunity to test mariabackup! [06:28:57] I tested locally and seemed fine [07:54:35] 10DBA, 10Patch-For-Review: Productionize old/temporary eqiad sanitariums - https://phabricator.wikimedia.org/T196376 (10Marostegui) [07:54:37] 10DBA, 10Patch-For-Review: Productionize old/temporary eqiad sanitariums - https://phabricator.wikimedia.org/T196376 (10Marostegui) db1120 is now serving traffic in x1 [08:37:55] 10DBA, 10Data-Services, 10Datasets-General-or-Unknown: Archive and drop education program (ep_*) tables on all wikis - https://phabricator.wikimedia.org/T174802 (10ArielGlenn) [08:46:56] 10Blocked-on-schema-change, 10DBA, 10Patch-For-Review, 10Schema-change: Truncate SHA-1 indexes - https://phabricator.wikimedia.org/T51190 (10Marostegui) [08:46:58] 10Blocked-on-schema-change, 10DBA, 10Wikidata, 10Patch-For-Review, 10Schema-change: Drop eu_touched in production - https://phabricator.wikimedia.org/T144010 (10Marostegui) [08:47:12] 10DBA, 10Patch-For-Review, 10Schema-change: Convert UNIQUE INDEX to PK in Production - https://phabricator.wikimedia.org/T199368 (10Marostegui) [09:34:19] 10DBA, 10Operations, 10ops-codfw: pc2006 rebooted itself - https://phabricator.wikimedia.org/T200641 (10Peachey88) [10:11:31] https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/449141/ I would appreciate a review of that before tomorrow (it is the row B depools for the network maintenance that is happening tomorrow) [11:15:48] Have you noticed any reduction on loads of vslow nodes? [11:20:34] Amir1: could you be more specific? [11:20:54] in time or resouces? [11:21:34] jynus: I just deployed a change (that was deployed on wikidata last week) which remove very long-running queries from vslow nodes. [11:21:53] I want to see if it helped with anything tangible [11:22:50] I can check, but normally metrics take some time to notice due to variability [11:31:47] jynus: so the change got deployed on wikidata, enwiki, dewiki, and commons at 16:49 UTC 2018-07-25. If that can be checked now [11:33:32] oh, 25 is enough time indeed [11:33:58] jynus: https://grafana.wikimedia.org/dashboard/db/mysql?panelId=3&fullscreen&orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=db1106&var-port=9104&from=now-30d&to=now [11:34:00] Wow [11:34:26] lol [11:34:58] s8 https://grafana.wikimedia.org/dashboard/db/mysql?panelId=3&fullscreen&orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=db1087&var-port=9104&from=now-30d&to=now [11:35:46] do you have the patch handy, I would like to read it [11:35:48] ? [11:35:57] s5 https://grafana.wikimedia.org/dashboard/db/mysql?panelId=3&fullscreen&orgId=1&from=now-30d&to=now&var-dc=eqiad%20prometheus%2Fops&var-server=db1113&var-port=13315 [11:36:01] sure [11:36:11] This is the config: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/447834/1/wmf-config/InitialiseSettings.php [11:36:16] not very useful ofc [11:36:32] oh, change tag? [11:36:56] certainly it seemed it wasn't scaling [11:37:02] jynus: This is the core change: https://gerrit.wikimedia.org/r/c/mediawiki/core/+/445179 [11:37:10] although I would wait until today to confirm [11:37:25] as weekeds have different patterns [11:37:43] jynus: Load time of special:tags in wikidata went from one minute to one second [11:37:54] oh, I knew the issue [11:38:07] I think it was the one that they said they were caching [11:38:14] but the uncached queries had the same issue [11:38:21] stats gathering or something [11:38:27] yup [11:38:31] if it is not that, a very similar issue [11:38:56] jynus: I have been working on this in the past couple of months, the plan is to basically drop ct_tag which would save a lot of space and probably all of tag_summary table alotogether [11:39:05] Amir1: you rule [11:39:14] you know that? [11:39:18] using the normalized version of it called change_tag_def [11:39:34] that is great, I just forget about that after a few weeks [11:39:36] I'm literally blushed right now [11:39:57] you are literally saving thousands of dollars [11:40:13] and/or allowing other functionality to run there when needed [11:40:45] thank you [11:41:04] <3 Thanks. I hope I can help more [11:41:12] you are [11:58:06] s7: https://grafana.wikimedia.org/dashboard/db/mysql?panelId=3&fullscreen&orgId=1&from=now-6h&to=now&var-dc=eqiad%20prometheus%2Fops&var-server=db1090&var-port=13317 [11:58:17] Guess when the change got deployed [12:26:06] Amir1: one thing I see on the negative side, however is s8 reads increase: [12:26:22] it coudd be just maintenance [12:26:46] https://grafana.wikimedia.org/dashboard/db/mysql-aggregated?orgId=1&var-dc=eqiad%20prometheus%2Fops&var-group=core&var-shard=All&var-role=slave&from=1531161957276&to=1532953533360 [12:27:18] but has been on for 7 days [12:28:10] wait, it may not be s8 [12:29:44] it is s8, starting 2018-07-23 at 9:00 UTC [12:29:54] very stable [12:30:37] 09:00 Amir1: start of ladsgroup@mwmaint1001:~$ foreachwikiindblist s8 populateChangeTagDef.php [12:30:44] so I think it is expected [12:30:53] yup, it'll be done in 30 days [12:30:53] so no worries [12:31:09] cool, I just was worried it would be a query not using indexes [12:31:17] maintenance has no issues [12:31:20] because it is temporary [12:31:27] so all good [12:31:29] it uses the new change_tag_def table that has only 100 rows :D [12:31:36] all cool [12:31:39] thanks [12:31:45] \o/ [12:31:56] I should have automated alerts [12:32:00] for those [12:32:16] so query pattern changes have explanation [12:33:00] I will eventually be done as part of T172492 [12:33:00] T172492: Improve database alerting (tracking) - https://phabricator.wikimedia.org/T172492 [12:42:23] FYI if you have some time today I'd appreciate a review of this https://gerrit.wikimedia.org/r/c/operations/puppet/+/449175 [12:42:29] should be straightforward enough [12:42:38] I was looking at it [12:42:47] I wonder why was that on haproxy module? [12:43:28] I think it was like that since almost the beginning, histerical raisins [12:43:37] hysterical? LOL [12:43:51] ah right, yes hysterical raisins [12:44:35] and it is not in use by the default server, for what I see? [12:45:58] it is, but it is overrriden everwhere, I guess [12:46:22] let me double check in a second other classes [12:46:27] godog: for completness [12:46:41] s/classes/roles+profiles/ [12:46:49] and will +1 in a second [12:47:00] jynus: for sure [12:47:23] I wouldn't be surprised analytics or someone else use it [12:48:06] I've just also spotted an unrelated bug [12:48:28] I checked but it doesn't seem so [12:48:29] # cumin C:haproxy [12:48:29] 12 hosts will be targeted: [12:48:29] dbproxy[1001-1011].eqiad.wmnet,thumbor2001.codfw.wmnet [12:50:40] ah, cool [12:50:48] I wanted to do grep [12:50:59] in case there is an existing but unused class [12:51:17] is thumbor2001 ok? [12:51:28] yeah that's me, I'm deploying haproxy there [12:51:36] that's how I found this bug [12:51:36] this is th bug https://gerrit.wikimedia.org/r/#/c/449179/ [12:51:42] (a different one) [12:52:46] ack, ok to merge # cumin C:haproxy [12:52:47] 12 hosts will be targeted: [12:52:47] dbproxy[1001-1011].eqiad.wmnet,thumbor2001.codfw.wmnet [12:52:52] nope [12:52:56] https://gerrit.wikimedia.org/r/c/operations/puppet/+/449175 [12:54:50] looks good to merge? [12:54:59] and another bug [12:55:23] https://gerrit.wikimedia.org/r/449180 [12:55:50] everything else is ok [12:56:39] we may want to review the default configuration [12:56:44] but for another time [12:56:55] the default one should be very vanilla [12:57:15] but hope it works for now for you [12:57:31] also it is great you pinged us because at the moment, we thought we were the only users [12:57:40] so we changed it happily [12:57:49] heheh [12:58:22] yeah we're trying it out for thumbor, see if it does better than nginx [12:58:34] I'll take a look at your patches once I'm done with this one [12:58:45] I am adding you to 449179 [12:58:48] but no rush [12:59:30] we can move it to tmpfiles better [12:59:57] but the rule nowadays is to never, ever use /tmp [13:00:09] (which used to be the default) [13:00:49] also, 2 things- we don't use the http admin port [13:00:57] ask if you need help to use the socket [13:01:09] and we should deploy the haproxy prometheus exporter at some point :-) [13:01:26] heh I was about to say, we'd need the prometheus exporter for thumbor too for sure [13:01:34] the haproxy one [13:01:43] or specifically for the app? [13:01:51] yes, haproxy exporter deployed for thumbor [13:01:56] s/for/on/ [13:02:01] it exists, but it is not packaged, IIFC [13:02:19] it is on the large pile of TODOs for me [13:02:47] maybe with the help of a good DD, do you know someone? [13:02:50] :-D [13:02:55] good ones? no :P [13:02:58] LOL [13:03:29] seriously tho, I'll mention it to Gilles too since he's working on haproxy+thumbor [13:03:36] I can help too of course [13:15:47] sorry de default config was unmaintained, we only used dbproxy role, so never bothered to handle that [13:15:49] *the [13:16:12] np, I think with https://gerrit.wikimedia.org/r/c/operations/puppet/+/449183 we're good to go [13:16:47] don't need to block on me for that, as we already knew that the only other role using it uses its own config [13:17:01] ok! [13:17:34] at least I parametrized socket and pid [13:17:42] when I had to migrate away from tmp [13:18:04] AND fixed jessie and stretch compatiblity issues [13:18:12] 1.7 deprecated lots of options [13:18:27] so you got some things for free [13:20:59] heheh indeed [13:21:27] there's also the big question of providing a default for every user in puppet or everyone does their default [13:21:48] easier to break stuff in the former of course [13:24:57] yeah, that is a phyilosophy thing- I do it anyway [13:25:28] there are also some weird stuff, like supporting several files on config, which was easy with init.d but a nightmare to support with systemd [13:25:46] (not systemd's issue, but haproxy, which is weird) [13:26:19] and that means now that if you change the config file path, you cannot reload, only restart [13:28:32] 10Blocked-on-schema-change, 10DBA, 10Patch-For-Review, 10Schema-change: Truncate SHA-1 indexes - https://phabricator.wikimedia.org/T51190 (10Marostegui) s8 eqiad progress [] labsdb1011 [] labsdb1010 [] labsdb1009 [x] dbstore1002 [] db1124 [] db1109 [] db1104 [] db1101 [] db1099 [] db1092 [] db1087 [] db1071 [13:28:33] 10DBA, 10Patch-For-Review, 10Schema-change: Convert UNIQUE INDEX to PK in Production - https://phabricator.wikimedia.org/T199368 (10Marostegui) s8 eqiad progress [] labsdb1011 [] labsdb1010 [] labsdb1009 [x] dbstore1002 [] db1124 [] db1109 [] db1104 [] db1101 [] db1099 [] db1092 [] db1087 [] db1071 [13:28:36] 10Blocked-on-schema-change, 10DBA, 10Wikidata, 10Patch-For-Review, 10Schema-change: Drop eu_touched in production - https://phabricator.wikimedia.org/T144010 (10Marostegui) s8 eqiad progress [] labsdb1011 [] labsdb1010 [] labsdb1009 [x] dbstore1002 [] db1124 [] db1109 [] db1104 [] db1101 [] db1099 [] db... [13:28:50] 10Blocked-on-schema-change, 10DBA, 10Patch-For-Review, 10Schema-change: Truncate SHA-1 indexes - https://phabricator.wikimedia.org/T51190 (10Marostegui) [13:29:06] 10Blocked-on-schema-change, 10DBA, 10Wikidata, 10Patch-For-Review, 10Schema-change: Drop eu_touched in production - https://phabricator.wikimedia.org/T144010 (10Marostegui) [13:29:08] 10DBA, 10Patch-For-Review, 10Schema-change: Convert UNIQUE INDEX to PK in Production - https://phabricator.wikimedia.org/T199368 (10Marostegui) [14:03:19] 10DBA, 10Operations, 10ops-codfw: pc2006 rebooted itself - https://phabricator.wikimedia.org/T200641 (10Marostegui) p:05Triage>03Normal [14:27:21] 10DBA, 10JADE, 10Operations, 10Scoring-platform-team, 10TechCom-RFC: Introduce a new namespace for collaborative judgments about wiki entities - https://phabricator.wikimedia.org/T200297 (10mark) I am a bit confused by this RFC/proposal as it stands now, as I feel it doesn't really reflect the discussion... [14:28:02] 10DBA, 10Operations, 10ops-codfw: pc2006 rebooted itself - https://phabricator.wikimedia.org/T200641 (10jcrespo) There was alread a BIOS upgrade at T139714, I would contact directly support as suggested by robh here: https://phabricator.wikimedia.org/T139283#2430289 [14:33:32] 10DBA, 10Operations, 10ops-codfw: pc2006 rebooted itself - https://phabricator.wikimedia.org/T200641 (10Marostegui) That sounds good to me. This is a racadm getsel so it can be sent to support: ``` /admin1-> racadm getsel Record: 1 Date/Time: Source: system Severity: Ok Descripti... [15:06:55] Amir1: Not sure if this still affects you, but tomorrow we will be depooling two hosts for s8, for a network maintenance: https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/449141/3/wmf-config/db-eqiad.php [15:07:19] That will happen tomorrow [15:11:23] 10DBA, 10JADE, 10Operations, 10Scoring-platform-team, 10TechCom-RFC: Introduce a new namespace for collaborative judgments about wiki entities - https://phabricator.wikimedia.org/T200297 (10Harej) Thank you for the summary, @mark. I am interested in this perspective that we can get the same user experien... [15:15:20] 10DBA, 10JADE, 10Operations, 10Scoring-platform-team, 10TechCom-RFC: Introduce a new namespace for collaborative judgments about wiki entities - https://phabricator.wikimedia.org/T200297 (10Halfak) @mark, I think you have raise some good points here. I think there is a point of confusion around the "hop... [15:29:43] marostegui: I'm still running the script on s8, should I stop it when you depooled the nodes? how long it takes? [15:30:27] The network maintenance? it is scheduled for 4 hours [15:30:34] But I will be depooling the hosts early in the morning [15:30:36] So let's say 24h [15:30:39] just to be safe [15:33:27] hmm, that's pretty long :/ [15:34:27] I want to depool the nodes early in the morning so the sections start to get used to serve the traffic with less hosts, so if there are load issues we can see them before the network maintenance [15:34:43] Rather than depooling a bunch of hosts right before the maintenance and realising that sections struggle :) [15:34:44] I stop the script early in morning tomorrow. It's basically because of the load. It reads a lot [15:34:55] Then yeah, better to stop it [15:35:39] marostegui: I think you can stop it for me, I have two screens running on mwmaint1001 one of them is on wikidatawiki and the other one is enwiki [15:35:52] 10DBA, 10JADE, 10Operations, 10Scoring-platform-team, 10TechCom-RFC: Introduce a new namespace for collaborative judgments about wiki entities - https://phabricator.wikimedia.org/T200297 (10Marostegui) >>! In T200297#4461793, @Halfak wrote: > > Past evidence (e.g. see Flow) suggests that it is not reali... [15:35:59] Amir1: let me seee [15:38:48] Amir1: how can I know which one is enwiki and which one is wikidata? [15:39:01] The queries do not give me any hint :) [15:39:37] hmm, let me find a way [15:40:07] enwiki: Updating ct_tag_id = 4 up to row ct_id = 32774501 [15:40:08] i found it [15:40:10] yeah [15:40:21] screen was cutting of some lines [15:40:40] 29358.pts-3.mwmaint1001 -> that one is wikidata [15:40:48] so tomorrow I can just ctrl+c it? [15:41:12] marostegui: yup [15:41:31] Great you want me to !log it tomorrow? If so, which task? [15:41:57] marostegui: This: https://phabricator.wikimedia.org/T193873 [15:42:06] great! [15:42:08] Will do then :) [15:42:12] Thank you for your understanding [15:42:19] better be safe :) [15:42:25] The task actually not very correct because the table is already populated, this one populates ct_tag_id in change_tag table but meh [15:42:32] sure thing [15:54:58] marostegui: You are depooling nodes from s1 too, do you want to stop the enwiki as well? [15:55:12] enwiki will finish soon-ish (probably in five days or so) [15:55:15] yeah, I will stop both [15:55:28] kk [17:10:54] I'm deleting lots of rows from ores_classification of enwiki [17:11:05] Already in SAL but FYI [17:11:42] 10DBA, 10JADE, 10Operations, 10Scoring-platform-team, 10TechCom-RFC: Introduce a new namespace for collaborative judgments about wiki entities - https://phabricator.wikimedia.org/T200297 (10awight) Hi, it's great to see the activity on this RFC, thank you all for the input. To expand on our answer about... [17:12:23] https://phabricator.wikimedia.org/T200680 [17:21:02] 10DBA, 10JADE, 10Operations, 10Scoring-platform-team, 10TechCom-RFC: Introduce a new namespace for collaborative judgments about wiki entities - https://phabricator.wikimedia.org/T200297 (10awight) @Marostegui, I'd like to explore how we might be able to use x1 for storage. We aren't currently consideri... [17:38:04] 10Blocked-on-schema-change, 10DBA, 10Patch-For-Review, 10Schema-change: Truncate SHA-1 indexes - https://phabricator.wikimedia.org/T51190 (10Marostegui) [18:02:04] 10DBA, 10JADE, 10Operations, 10Scoring-platform-team, 10TechCom-RFC: Introduce a new namespace for collaborative judgments about wiki entities - https://phabricator.wikimedia.org/T200297 (10awight) There's another interesting point here, from @mark, > According to T196547 there seems to be the expectatio... [18:05:08] 10DBA, 10JADE, 10Operations, 10Scoring-platform-team, 10TechCom-RFC: Introduce a new namespace for collaborative judgments about wiki entities - https://phabricator.wikimedia.org/T200297 (10awight) I want to explicitly ask something that has come up among our team: Can we agree on guidelines for new cont... [19:25:57] 10DBA, 10JADE, 10Operations, 10Scoring-platform-team, 10TechCom-RFC: Introduce a new namespace for collaborative judgments about wiki entities - https://phabricator.wikimedia.org/T200297 (10jcrespo) > Can we agree on guidelines for new content extensions, so that nobody needs to go through this discussio... [20:17:51] 10DBA, 10JADE, 10Operations, 10Scoring-platform-team, 10TechCom-RFC: Introduce a new namespace for collaborative judgments about wiki entities - https://phabricator.wikimedia.org/T200297 (10Halfak) @Jcrespo, I can't find the policy you are referencing. Can you link to the specific lines that apply here?... [21:56:05] 10DBA, 10JADE, 10Operations, 10Scoring-platform-team, 10TechCom-RFC: Introduce a new namespace for collaborative judgments about wiki entities - https://phabricator.wikimedia.org/T200297 (10awight) @jcrespo Thanks for the good tips, I agree with most of what you say. FWIW, I was working off of the exten... [23:59:06] 10DBA, 10JADE, 10Operations, 10Scoring-platform-team, 10TechCom-RFC: Introduce a new namespace for collaborative judgments about wiki entities - https://phabricator.wikimedia.org/T200297 (10awight) @mark I think you're right that our future potential use cases involving bulk bots should be dropped and re...