[00:19:49] 10DBA, 10Wikimedia-production-error: MediaWiki: "host db1062 is unreachable" (Connection refused) - https://phabricator.wikimedia.org/T239874 (10Krinkle) According to , db1062 is still part of s7. Also, less frequently, I also see MW errors in Logstash about two other s7 host... [00:27:08] 10DBA, 10Wikimedia-production-error: MediaWiki: "host db1062 is unreachable" (Connection refused) - https://phabricator.wikimedia.org/T239874 (10Krinkle) [00:28:48] 10DBA, 10Wikimedia-production-error: MediaWiki: "host db1062 is unreachable" (Connection refused) - https://phabricator.wikimedia.org/T239874 (10Krinkle) [00:48:20] 10DBA, 10Wikimedia-production-error: MediaWiki: "host db1062 is unreachable" (Connection refused) - https://phabricator.wikimedia.org/T239874 (10Krinkle) From and : > db1086: 0 > db1062: 0 `name=etcd…/eqiad.json,lang=jso... [00:48:22] 10DBA, 10Wikimedia-production-error: MediaWiki: "host db1062 is unreachable" (Connection refused) - https://phabricator.wikimedia.org/T239874 (10colewhite) It seems clear that db1062 shouldn't be pooled anywhere. Ran the dbctl depool utility and it's gone from s7. [00:55:42] 10DBA, 10Wikimedia-production-error: MediaWiki: "host db1062 is unreachable" (Connection refused) - https://phabricator.wikimedia.org/T239874 (10Krinkle) 05Open→03Resolved a:03Krinkle Thanks. Continuing at T239877 about the other two, which seems unrelated at this point (they are still "normal" replica d... [00:56:10] 10DBA, 10Wikimedia-production-error: After deploy of 1.35.0-wmf.8 to group1, surge of "Cannot access the database: Unknown error" - https://phabricator.wikimedia.org/T239877 (10Krinkle) [05:42:40] 10DBA, 10Wikimedia-production-error: MediaWiki: "host db1062 is unreachable" (Connection refused) - https://phabricator.wikimedia.org/T239874 (10Marostegui) Thanks for tackling this and apologies for not depooling it when I should've. db1062 stopped being a master a week ago {P9743} We normally depool it stra... [05:43:44] 10Blocked-on-schema-change, 10DBA, 10Core Platform Team: Schema change for refactored actor and comment storage - https://phabricator.wikimedia.org/T233135 (10Marostegui) [05:45:11] 10DBA: Recompress special slaves across eqiad and codfw - https://phabricator.wikimedia.org/T235599 (10Marostegui) 05Open→03Resolved [05:45:11] 10DBA: Recompress special slaves across eqiad and codfw - https://phabricator.wikimedia.org/T235599 (10Marostegui) All done [05:45:26] 10DBA: Compress new Wikibase tables - https://phabricator.wikimedia.org/T232446 (10Marostegui) [05:56:30] 10DBA, 10Wikimedia-production-error: After deploy of 1.35.0-wmf.8 to group1, surge of "Cannot access the database: Unknown error" - https://phabricator.wikimedia.org/T239877 (10Marostegui) I believe db1094 and db1136 being unreachable was the consequence and not the cause. Both hosts had a huge spike on conne... [06:01:30] 10DBA, 10Operations, 10Puppet, 10User-jbond: DB: perform rolling restart of mariadb deamons to pick up CA changes - https://phabricator.wikimedia.org/T239791 (10Marostegui) [06:27:29] 10DBA, 10Wikimedia-production-error: After deploy of 1.35.0-wmf.8 to group1, surge of "Cannot access the database: Unknown error" - https://phabricator.wikimedia.org/T239877 (10Marostegui) I have not been able to find errors for those two hosts outside of the spikes, which I believe means that they are normall... [06:51:19] 10DBA, 10Patch-For-Review: decommission db2065.codfw.wmnet - https://phabricator.wikimedia.org/T239046 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by marostegui@cumin1001 for hosts: `db2065.codfw.wmnet` - db2065.codfw.wmnet (**PASS**) - Downtimed host on Icinga - Downtimed management... [06:57:44] Amir1: I have merged your change [07:13:40] 10DBA, 10Operations, 10Puppet, 10User-jbond: DB: perform rolling restart of mariadb deamons to pick up CA changes - https://phabricator.wikimedia.org/T239791 (10Marostegui) [07:46:20] 10DBA, 10Operations: Decommission db2043-db2070 - https://phabricator.wikimedia.org/T228258 (10Marostegui) [08:07:33] 10DBA, 10Patch-For-Review: Decommission db1062.eqiad.wmnet - https://phabricator.wikimedia.org/T239188 (10Marostegui) [08:25:55] 10DBA, 10Operations, 10Puppet, 10User-jbond: DB: perform rolling restart of mariadb deamons to pick up CA changes - https://phabricator.wikimedia.org/T239791 (10Marostegui) [08:29:50] marostegui: thanks! [08:42:01] 10DBA, 10Operations, 10Puppet, 10User-jbond: DB: perform rolling restart of mariadb deamons to pick up CA changes - https://phabricator.wikimedia.org/T239791 (10Marostegui) a:05jbond→03Marostegui [09:10:47] 10DBA, 10Operations, 10Puppet, 10User-jbond: DB: perform rolling restart of mariadb deamons to pick up CA changes - https://phabricator.wikimedia.org/T239791 (10Marostegui) [09:18:06] 10DBA, 10Operations, 10Puppet, 10User-jbond: DB: perform rolling restart of mariadb deamons to pick up CA changes - https://phabricator.wikimedia.org/T239791 (10Marostegui) [10:18:38] 10DBA, 10Wikimedia-production-error: After deploy of 1.35.0-wmf.8 to group1, surge of "Cannot access the database: Unknown error" - https://phabricator.wikimedia.org/T239877 (10Ladsgroup) Given the dbs (in s7) I highly doubt it's wikidata but I also want to mention that s7 has only frwiktionary and metawiki as... [10:20:24] 10DBA, 10Wikimedia-production-error: After deploy of 1.35.0-wmf.8 to group1, surge of "Cannot access the database: Unknown error" - https://phabricator.wikimedia.org/T239877 (10Marostegui) These queries showed up at the time of the incident, and I cannot see them happening before it: {P9822} I have not been a... [10:22:13] 10DBA, 10Wikimedia-production-error: After deploy of 1.35.0-wmf.8 to group1, surge of "Cannot access the database: Unknown error" - https://phabricator.wikimedia.org/T239877 (10Marostegui) >>! In T239877#5714834, @Ladsgroup wrote: > Given the dbs (in s7) I highly doubt it's wikidata but I also want to mention... [10:28:23] What is db1092? API? [10:29:15] yes [10:29:22] s8 api [10:29:26] https://noc.wikimedia.org/dbconfig/eqiad.json [10:30:02] I think we need a prettier and more complete https://noc.wikimedia.org/db.php [10:30:40] also, where is cluster s11? [10:30:52] what is s11? [10:31:07] that wierd testlabscloudwikidb-test [10:31:09] from cloud [10:31:14] I thought that was s10 [10:31:19] ah no, s10 is just wikitech? [10:31:21] s10 is wikitech [10:31:23] yeah [10:31:36] oh, thanks I didn't know about the new thing there [10:31:46] I don't see it either on https://noc.wikimedia.org/dbconfig/eqiad.json [10:32:07] it is not on the json [10:32:10] it is hardcoded [10:32:16] maybe that is why it doesn't show up [10:32:18] great..... [10:32:29] but I think it should be on the config [10:32:38] at least commented out [10:33:24] it is there, see: https://noc.wikimedia.org/conf/highlight.php?file=db-eqiad.php [10:33:47] so s11 is not handled by etcd? [10:33:50] https://gerrit.wikimedia.org/g/operations/mediawiki-config/+blame/master/wmf-config/db-eqiad.php?blame=1#91 [10:34:06] no, that was done on purpose, as it is a test thigy on production [10:34:21] note I didn't decide that, just a messenger [10:34:25] sure [10:34:47] but the fact that the lead dba doesn't know it is troubling :-P [10:34:52] so s10 is handled by etcd but not s11? [10:35:01] that seems to me [10:35:11] let me check s10 [10:35:27] root@cumin1001:/home/marostegui# dbctl config get | grep s10 [10:35:28] "s10": [ [10:35:28] "s10": [ [10:35:30] looks like yeah [10:35:38] so there you have it [10:35:46] yeah, s10 is on eqiad.json [10:36:26] hopefully one day everything will be handled by etcd [10:36:43] I asked chris a few days ago, and they are expecting to handle ExternalLoads section with etcd by the end of the Q [10:37:11] i saw that [10:37:14] look: [10:37:17] https://gerrit.wikimedia.org/g/operations/mediawiki-config/+blame/master/wmf-config/CommonSettings.php?blame=1#280 [10:37:40] ah right [10:37:40] thanks [10:42:27] btw. you know about the spikes? https://phabricator.wikimedia.org/T143870#5713514 querying parser cache says these spikes are responsible for 33% to 50% of all entries in PC, If I can fix that, I think you won't need to buy hardware for PC for years [10:43:07] but I can't find out what's causing it, if you have any data to help, please let me know, I know API and general replicas are getting the spikes: https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&from=now-6h&to=now&fullscreen&panelId=3&var-dc=eqiad%20prometheus%2Fops&var-server=db1126&var-port=9104 [10:43:12] 50% of the entries? :o [10:43:30] Amir1: if those are writes (not reads) it is very easy [10:43:39] Amir1: I can grab the binlogs for you [10:43:40] examinging the binlog at the time of the spikes [10:44:19] jynus: it's writes only on PC, I can't find anything in writes at mysql [10:44:25] marostegui: that would be great [10:44:47] Amir1: if you give me a day and an approximate interval time, I can check the binlog for the writes that happened on mysql [10:44:50] I have been reading web requests table in hadoop, no offender stood out, [10:45:21] Does this help? https://grafana.wikimedia.org/d/000000106/parser-cache?orgId=1&from=now-24h&to=now&var-contentModel=wikibase_item&var-contentModel=wikibase_property&fullscreen&panelId=8 [10:45:33] yep [10:45:52] I will try to get some info for you, either today or next week (public holiday tomorrow and monday) [10:46:13] ooh, enjoy it then [10:47:44] oh, I didn't remember about holidays [10:47:52] jynus: you are welcome! [11:06:50] 10DBA, 10Wikimedia-production-error: After deploy of 1.35.0-wmf.8 to group1, surge of "Cannot access the database: Unknown error" - https://phabricator.wikimedia.org/T239877 (10Krinkle) Note that s7 is metawiki + centralauth which are involved on requests for all wikis pretty much all the time. That includes g... [11:41:37] 10DBA, 10Core Platform Team, 10Wikimedia-Rdbms: Sync understanding of MediaWiki rdbms 'weight' behaviour with DBAs - https://phabricator.wikimedia.org/T239900 (10Krinkle) [11:44:48] 10DBA, 10Operations, 10Wikimedia-Incident: Disallow 'weight: 0' for MW db config in dbctl - https://phabricator.wikimedia.org/T239901 (10Krinkle) [11:47:53] 10DBA, 10Operations, 10Wikimedia-Incident: Disallow 'weight: 0' for MW db config in dbctl - https://phabricator.wikimedia.org/T239901 (10Krinkle) [13:00:30] Backup success ratio in the last 24 hours: 5.072% [13:01:39] the funny thing is that it is not wrong, however backups are technically ok, with only 1 job running late (gerrit) due to overload [13:22:59] 10DBA, 10Core Platform Team, 10Wikimedia-Rdbms: Sync understanding of MediaWiki rdbms 'weight' behaviour with DBAs - https://phabricator.wikimedia.org/T239900 (10Marostegui) I would definitely like to be able to give read traffic to the master. It won't happen often and in 3 years I only recall once or twice... [15:01:06] 10DBA, 10Core Platform Team, 10Wikimedia-Rdbms: Sync understanding of MediaWiki rdbms 'weight' behaviour with DBAs - https://phabricator.wikimedia.org/T239900 (10Anomie) > 1. Does MediaWiki as used by WMF (wikimedia/rdbms, LBFactoryMulti) support giving the master db non-zero weight in terms of read queries... [15:01:16] 10DBA, 10Wikimedia-Rdbms, 10Core Platform Team Workboards (Clinic Duty Team): Sync understanding of MediaWiki rdbms 'weight' behaviour with DBAs - https://phabricator.wikimedia.org/T239900 (10Anomie) [15:28:06] 10DBA, 10Wikimedia-Rdbms, 10Core Platform Team Workboards (Clinic Duty Team): Sync understanding of MediaWiki rdbms 'weight' behaviour with DBAs - https://phabricator.wikimedia.org/T239900 (10jcrespo) > Should we consider changing any of this? Not a DBA, but I think we should throw away the Mediawiki load b... [15:34:40] is backup1001 unresposive or it is just me? [15:35:17] apparently only ssh [15:41:45] yeah, it is overloaded [15:41:51] I am going to restart it [16:38:19] what happened in the end? [16:38:23] (I was in an interview) [16:53:57] restarted it [16:54:03] disabled copy jobs [16:54:16] I hope it doesn't overload again [17:02:46] 10DBA, 10Wikimedia-Rdbms, 10Core Platform Team Workboards (Clinic Duty Team): Sync understanding of MediaWiki rdbms 'weight' behaviour with DBAs - https://phabricator.wikimedia.org/T239900 (10Anomie) We can't entirely throw away the loadbalancer/lbfactory code, as it's also used to manage connections to diff... [17:21:55] 10DBA, 10Wikimedia-Rdbms, 10Core Platform Team Workboards (Clinic Duty Team): Sync understanding of MediaWiki rdbms 'weight' behaviour with DBAs - https://phabricator.wikimedia.org/T239900 (10Marostegui) >>! In T239900#5715511, @Anomie wrote: >> 1. Does MediaWiki as used by WMF (wikimedia/rdbms, LBFactoryMul... [17:25:35] 10DBA, 10Operations, 10Wikimedia-Incident: Disallow 'weight: 0' for MW db config in dbctl - https://phabricator.wikimedia.org/T239901 (10Marostegui) I am not sure if I want to fully disallow weight 0 for replicas, there are some cases where we might actually want that, Cross posting from: T239900 ` We used... [17:38:05] 10DBA, 10Wikimedia-Rdbms, 10Core Platform Team Workboards (Clinic Duty Team): Sync understanding of MediaWiki rdbms 'weight' behaviour with DBAs - https://phabricator.wikimedia.org/T239900 (10jcrespo) > we'd still probably lose the ability to reuse the same opened connection We could reuse at middleware sid... [17:44:48] 10DBA, 10Wikimedia-production-error: After deploy of 1.35.0-wmf.8 to group1, surge of "Cannot access the database: Unknown error" - https://phabricator.wikimedia.org/T239877 (10Anomie) >>! In T239877#5714837, @Marostegui wrote: > These queries showed up at the time of the incident, and I cannot see them happen... [17:55:10] 10DBA, 10Operations, 10Wikimedia-Incident: Disallow 'weight: 0' for MW db config in dbctl - https://phabricator.wikimedia.org/T239901 (10colewhite) p:05Triage→03Normal [19:41:54] 10DBA, 10Wikimedia-Rdbms, 10Core Platform Team Workboards (Clinic Duty Team): Sync understanding of MediaWiki rdbms 'weight' behaviour with DBAs - https://phabricator.wikimedia.org/T239900 (10Anomie) >>! In T239900#5716043, @Marostegui wrote: > We used to have vslow,dump with weight 0 just in case the dumps... [19:59:47] 10DBA, 10Patch-For-Review, 10Wikimedia-production-error: After deploy of 1.35.0-wmf.8 to group1, surge of "Cannot access the database: Unknown error" - https://phabricator.wikimedia.org/T239877 (10cscott) I bet that `WHERE foo=1` bypasses the index and does a scan of every single title string and converts it...