[00:19:46] <wikibugs>	 10DBA, 10Operations, 10ops-eqiad, 10Patch-For-Review: Rack and setup db1113 and db1114 - https://phabricator.wikimedia.org/T182896#3850067 (10Cmjohnson)
[00:20:10] <wikibugs>	 10DBA, 10Operations, 10ops-eqiad, 10Patch-For-Review: Rack and setup db1113 and db1114 - https://phabricator.wikimedia.org/T182896#3837888 (10Cmjohnson) @Marostegui These are ready for installs.
[06:32:08] <wikibugs>	 10DBA, 10Beta-Cluster-Infrastructure, 10MW-1.31-release-notes (WMF-deploy-2018-01-02 (1.31.0-wmf.15)), 10Patch-For-Review: Unbreak replication in beta cluster - https://phabricator.wikimedia.org/T183252#3850436 (10jcrespo) p:05Triage>03Lowest No one is probably going to work on this any time soon, desp...
[06:38:52] <marostegui>	 jynus: I saw you are also logged into dbstore1001 - so let's coordinate to fix replication just in case :-)
[06:39:15] <marostegui>	 just in case one deletes the row, the other starts replication and then we also delete the new row XD
[06:43:23] <marostegui>	 I see you are deleting the duplicates, I will hold here
[06:49:39] <marostegui>	 how many you had to delete in the end?
[06:49:46] <jynus>	 >183 rows
[06:49:53] <marostegui>	 oh my...
[06:49:55] <jynus>	 sorry, I diddn't see your comment
[06:50:14] <marostegui>	 no no, no worries, I was "following" you on the logs, I didn't want both of us deleting rows, so I held here :)
[06:51:25] <jynus>	 REPLACE /* WatchedItemStore::duplicateEntry  */ INTO `watchlist` is also not a safe statement
[06:51:47] <jynus>	 it has unintended effects
[06:52:11] <jynus>	 things will likely break again
[07:32:08] <jynus>	 dbproxy1004 identifies db1107 as down
[07:32:40] <marostegui>	 grants maybe?
[07:32:48] <marostegui>	 fw?
[07:32:53] <jynus>	 no, I reloaded it and it is now up
[07:34:50] <jynus>	 maybe some strange race condition on start (it was just restarted)
[07:35:18] <jynus>	 anyway, it was the passive proxy, so no big deal
[07:37:16] <wikibugs>	 10DBA, 10Operations, 10Patch-For-Review: Reimage and upgrade to stretch all dbproxies - https://phabricator.wikimedia.org/T183249#3850501 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jynus on neodymium.eqiad.wmnet for hosts: ``` ['dbproxy1005.eqiad.wmnet'] ``` The log can be found in `/var/...
[07:57:07] <wikibugs>	 10DBA, 10Operations, 10Patch-For-Review: Reimage and upgrade to stretch all dbproxies - https://phabricator.wikimedia.org/T183249#3850504 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jynus on neodymium.eqiad.wmnet for hosts: ``` ['dbproxy1007.eqiad.wmnet'] ``` The log can be found in `/var/...
[08:19:39] <wikibugs>	 10DBA, 10Operations, 10Patch-For-Review: Reimage and upgrade to stretch all dbproxies - https://phabricator.wikimedia.org/T183249#3850525 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['dbproxy1005.eqiad.wmnet'] ```  and were **ALL** successful.
[08:25:22] <wikibugs>	 10DBA, 10Operations, 10Patch-For-Review: Reimage and upgrade to stretch all dbproxies - https://phabricator.wikimedia.org/T183249#3850530 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jynus on neodymium.eqiad.wmnet for hosts: ``` ['dbproxy1008.eqiad.wmnet'] ``` The log can be found in `/var/...
[08:26:42] <wikibugs>	 10DBA, 10Operations, 10Patch-For-Review: Reimage and upgrade to stretch all dbproxies - https://phabricator.wikimedia.org/T183249#3850533 (10jcrespo) All proxies reimaged except the active ones: ``` dbproxy1002.eqiad.wmnet dbproxy1003.eqiad.wmnet dbproxy1006.eqiad.wmnet dbproxy1009.eqiad.wmnet dbproxy1010.eq...
[08:31:48] <jynus>	 I don't know how we will upgrade dbproxy1009 and 10, as we have not way to failover those. I guess we will point both temporarilly to the same proxy
[08:32:08] <marostegui>	 yeah, that is a good temporary solution
[08:32:19] <marostegui>	 there is not much more we can do apart from that
[08:33:26] <jynus>	 we can setup analytics as load balancing betweehn more than 1 server to handle the extra load
[08:33:40] <marostegui>	 analytics?
[08:35:07] <jynus>	 I say analytics, but I mean both server
[08:35:11] <jynus>	 *services
[08:35:41] <jynus>	 but analytics i the one that will be worrying if both services are served by a single host
[08:37:05] <marostegui>	 but this is only to reimage one proxy, no?
[08:40:21] <wikibugs>	 10DBA, 10Operations, 10Patch-For-Review: Reimage and upgrade to stretch all dbproxies - https://phabricator.wikimedia.org/T183249#3850573 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['dbproxy1007.eqiad.wmnet'] ```  and were **ALL** successful.
[08:42:57] <wikibugs>	 10DBA, 10Operations, 10ops-eqiad, 10Patch-For-Review: Rack and setup db1113 and db1114 - https://phabricator.wikimedia.org/T182896#3850580 (10Marostegui)
[08:42:58] <jynus>	 marostegui: it happened again
[08:43:17] <marostegui>	 must be a race condition then?
[08:43:20] <marostegui>	 than the first check fails?
[08:43:22] <jynus>	 dbproxy1007 set its master db as down  on start
[08:43:28] <jynus>	 that is important
[08:43:38] <jynus>	 as maybe it happens on start every time
[08:43:45] <marostegui>	 maybe the network comes up AFTER haproxy?
[08:43:52] <jynus>	 probably both checks fail
[08:43:54] <jynus>	 yeah
[08:44:05] <jynus>	 but only the secondary is retried
[08:44:11] <jynus>	 because flapping
[08:44:41] <jynus>	 we could test it- we have spare proxies to reboot them
[08:44:49] <marostegui>	 yeah, we should
[08:44:56] <marostegui>	 At least to know if it is "normal"
[08:44:57] <jynus>	 but the systemd is from upstream
[08:45:05] <marostegui>	 as in, not good, but at least we know what to expect
[08:45:13] <jynus>	 so far the reboots are after reimage
[08:45:27] <jynus>	 so it would be nice to know if it happens after a normal boot
[08:45:30] <marostegui>	 does it happen with a normal reboot?
[08:45:37] <marostegui>	 ah right, you've not tested it
[08:45:50] <jynus>	 After=network.target syslog.service
[08:46:05] <jynus>	 could it be the firewall having a race condition?
[08:46:19] <marostegui>	 what about after iptables?
[08:46:38] <jynus>	 didn't you read me or I am confused?
[08:47:24] <marostegui>	 yeah, I was wondering if maybe iptables comes in a different service than network
[08:47:30] <marostegui>	 or if network brings up the whole thing
[08:48:33] <jynus>	 actually, no, the systemd unit is hand-made
[08:48:52] <jynus>	 we could add extra dependencies or a delay
[08:49:17] <marostegui>	 we could test if a delay makes any difference
[08:49:24] <marostegui>	 to give network time to bring everything up (if that makes any sense)
[08:59:28] <jynus>	 interesting
[08:59:37] <jynus>	 I may have found a bug
[09:00:04] <marostegui>	 ?
[09:02:32] <jynus>	 debian packages are horrible
[09:03:00] <jynus>	 hardcoding paths
[09:08:32] <wikibugs>	 10DBA, 10Operations, 10Patch-For-Review: Reimage and upgrade to stretch all dbproxies - https://phabricator.wikimedia.org/T183249#3850615 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['dbproxy1008.eqiad.wmnet'] ```  and were **ALL** successful.
[09:10:59] <jynus>	 I don't thing https://gerrit.wikimedia.org/r/399359 is the problem
[09:11:02] <jynus>	 *think
[09:11:17] <jynus>	 but it was a bug
[09:33:42] <wikibugs>	 10Blocked-on-schema-change, 10DBA, 10Data-Services, 10Dumps-Generation, and 2 others: Schema change for refactored comment storage - https://phabricator.wikimedia.org/T174569#3850655 (10Marostegui) s1 is all done but the master. I will alter the master after Christmas.
[09:34:05] <wikibugs>	 10Blocked-on-schema-change, 10DBA, 10Data-Services, 10Dumps-Generation, and 2 others: Schema change for refactored comment storage - https://phabricator.wikimedia.org/T174569#3850656 (10Marostegui)
[09:34:31] <jynus>	 I don't think that was the bug, but it would make sure that it was exactly the same systemd than the original package, but without hardcoded paths
[09:35:14] <jynus>	 I will not restart dbproxy1001 to see if the problem keeps happening on a normal reboot
[09:36:12] <marostegui>	 cool
[09:36:39] <jynus>	 s/not/now/
[09:55:31] <jynus>	 there is a chance there was some pid weirdness, but I doubt it
[10:01:12] <jynus>	          Starting HAProxy Load Balancer...
[10:01:20] <jynus>	 [  OK  ] Started HAProxy Load Balancer.
[10:01:28] <jynus>	 [  OK  ] Started ferm firewall configuration.
[10:01:51] <marostegui>	 aha!
[10:01:54] <marostegui>	 so it could be that
[10:02:08] <jynus>	 most definitifely that
[10:02:21] <jynus>	 mariadb,db1016,0,0,0,0,,0,0,0,,0,,0,0,0,0,DOWN
[10:02:26] <marostegui>	 there we go!
[10:03:11] <jynus>	 is it really an haproxy bug or a ferm bug?
[10:03:30] <marostegui>	 well, none, right, they are doing their thing correctly, it is just the order
[10:04:53] <jynus>	 what I mean is
[10:05:00] <jynus>	 moritzm: do you have a minute?
[10:06:10] <moritzm>	 sure, in a few minutes, currently unbreaking puppet runs on labservices
[10:08:54] <moritzm>	 k, done
[10:11:01] <jynus>	 our thesis is that ferm blocks the network for some seconds on start
[10:11:43] <jynus>	 but I set After=network.target syslog.service so I expect network to be fully up to start my service
[10:12:02] <marostegui>	 but the ferm service comes with the network one?
[10:12:12] <moritzm>	 is that jessie or stretch?
[10:12:12] <jynus>	 yeah, I would expect so
[10:12:16] <jynus>	 stretch
[10:12:34] <jynus>	 so, we can workaround that problem
[10:13:05] <moritzm>	 so in stretch ferm changed a bit:
[10:13:06] <jynus>	 but probably is is something that ferm (or all services using it) should try to solve
[10:14:14] <moritzm>	 https://phabricator.wikimedia.org/T166653 is the task we ran into with our first stretch install
[10:14:30] <moritzm>	 but let read backscroll first
[10:14:34] <moritzm>	 but let me read backscroll first
[10:14:53] <jynus>	 moritzm: the specific issue is that the proxy failsover on start
[10:15:18] <jynus>	 which is dangerous if it is an uncontroled crash or something
[10:16:19] <jynus>	 moritzm: I do not need you you have a deeper look, I just want you to give you a heads up, and if you think it is a ferm problem, or a problem around ferm so I can file a task
[10:18:09] <jynus>	 "This makes ferm start 1-1.5 seconds later" could be an issue indeed, but I don't see how ferm starting could make outgoing connections fail
[10:19:54] <moritzm>	 I think the best solution forward would be to find a way to avoid resolve() statements in our ferm rules (and maybe have puppet resolve these for us), then we could switch back to Debian defaults and start ferm in early boot
[10:20:23] <moritzm>	 I wouldn't expect a problem with outgoing connections, no
[10:20:24] <jynus>	 that seems doable
[10:20:36] <jynus>	 a ruby function that resolves on the puppet master?
[10:20:48] <moritzm>	 I think so
[10:21:03] <jynus>	 ok, but if you do not expect outgoing connections problems
[10:21:07] <jynus>	 maybe it is not ferm
[10:21:24] <moritzm>	 yeah, it's an unrelated issue we should resolve I guess
[10:21:47] <jynus>	 well, intput DROP would drop returning queries even if related
[10:22:14] <jynus>	 is kernel in ACCEPT by default?
[10:23:02] <jynus>	 the thing is, the only difference between when it worked and now is the deployment of the firewall
[10:23:33] <jynus>	 I guess I can add an ExecPre=sleep 10 ?
[10:24:03] <moritzm>	 ah, that's a good point. haproxy probably establishes long running connections, and changing the input queue will also affect the response packets
[10:24:20] <wikibugs>	 10DBA, 10Beta-Cluster-Infrastructure, 10MW-1.31-release-notes (WMF-deploy-2018-01-02 (1.31.0-wmf.15)), 10Patch-For-Review: Unbreak replication in beta cluster - https://phabricator.wikimedia.org/T183252#3850772 (10Addshore) @daniel I would mark this as blocking one of the MCR related tickets but i'm not re...
[10:24:25] <moritzm>	 jynus: sleep 10 sounds good to a test
[10:24:40] <jynus>	 but it is not very elegant
[10:24:49] <jynus>	 again, no issue with haproxy fix
[10:25:00] <jynus>	 my fear is this could happen with other services
[10:25:01] <moritzm>	 (needs /bin/sleep, though for the Exec statment)
[10:25:05] <jynus>	 yes, sure
[10:25:14] <jynus>	 I would like to open a ticket
[10:25:19] <jynus>	 even if we do not act on it
[10:25:27] <jynus>	 as a reminder
[10:25:35] <jynus>	 or very very low priority
[10:25:36] <moritzm>	 the ferm resolution is an ongoing issue we should definitely fix
[10:25:57] <jynus>	 you think starting ferm earlier should fix this problem?
[10:26:00] <moritzm>	 might be worth discussiing with a few more people during the Ops offsite after allhands
[10:26:07] <moritzm>	 yeah, that should fix it
[10:26:15] <jynus>	 if yes, I would create a followup to the ticket you mentioned
[10:26:24] <moritzm>	 but we need to find a way to deal with the existing resolve() statements
[10:26:28] <jynus>	 sure
[10:26:38] <moritzm>	 I think we have a general ticket already, let me search for it
[10:26:44] <jynus>	 ok
[10:27:00] <jynus>	 I just didn't want to hack it without a reminder somewhere
[10:27:13] <moritzm>	 https://phabricator.wikimedia.org/T148986 is the same problem space, just a different facet you're seeing
[10:28:01] <jynus>	 ok, thanks, I wil comment on that
[10:41:45] <jynus>	 mariadb,db1016,0,0,0,0,,0,0,0,,0,,0,0,0,0,UP
[10:41:55] <jynus>	 but I do not like that sleep :-(
[10:42:21] <jynus>	 it means restarts now take 10 seconds
[10:51:57] <moritzm>	 FYI, upgrading openssl across the fleet, for db servers I'll only do the usual restarts of common low level services like sshd or nrpe)
[10:57:16] <jynus>	 ok
[11:02:36] <jynus>	 marostegui: there are grants missing on db1103
[11:02:42] <jynus>	 maybe others
[11:03:06] <marostegui>	 uh?
[11:03:11] <marostegui>	 which grants?
[11:03:15] <marostegui>	 production?
[11:03:44] <jynus>	 SELECT command denied to user '(user)'@'(silver ip)' for table 'heartbeat' (10.64.0.164:3314)
[11:04:30] <jynus>	 yes, s4 production, from silver
[11:05:37] <marostegui>	 I can take a look later if you like
[11:05:52] <jynus>	 sure, not in a hurry
[11:08:06] <jynus>	 https://logstash.wikimedia.org/goto/aebb12df5a0978a78c07fa5c2cf15bc6
[11:08:27] <marostegui>	 will take a look later, busy with all the archeology mess
[12:46:55] <Amir1>	 Hey, today we got lots database error https://phabricator.wikimedia.org/T183341
[12:47:10] <Amir1>	 https://phabricator.wikimedia.org/T183341
[12:47:16] <Amir1>	 https://logstash.wikimedia.org/goto/774037cfc17c66eae97a13727bd5a271
[12:47:32] <Amir1>	 There is lots of Error: 1205 Lock wait timeout exceeded; try restarting transaction
[12:47:45] <Amir1>	 I see you depooled some s5 hosts, can this be related?
[12:47:57] <marostegui>	 shouldn't be
[12:49:33] <marostegui>	 I see it is s5 master
[12:49:44] <marostegui>	 it was
[12:50:53] <wikibugs>	 10DBA, 10Wikidata: New item fails (Special and WEF tool) - https://phabricator.wikimedia.org/T183341#3851033 (10Ladsgroup) I see lots of database errors like [[https://logstash.wikimedia.org/app/kibana#/doc/logstash-*/logstash-2017.12.20/mediawiki?id=AWBz8Q-dOBGWx1mQuIRx&_g=() |this]] in logstash: ``` A databa...
[12:50:59] <marostegui>	 https://grafana.wikimedia.org/dashboard/db/mysql?panelId=24&fullscreen&orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=db1070&var-port=9104&from=1513760055021&to=1513764234612
[12:51:59] <wikibugs>	 10DBA, 10Wikidata: New item fails (Special and WEF tool) - https://phabricator.wikimedia.org/T183341#3850657 (10Marostegui) >>! In T183341#3851033, @Ladsgroup wrote: > I see lots of database errors like [[https://logstash.wikimedia.org/app/kibana#/doc/logstash-*/logstash-2017.12.20/mediawiki?id=AWBz8Q-dOBGWx1m...
[12:54:37] <wikibugs>	 10DBA, 10Wikidata: New item fails (Special and WEF tool) - https://phabricator.wikimedia.org/T183341#3851061 (10Billinghurst) if this is something that may recur, may I ask for the coordination of a more informative error message, or something that tells me to stop and come back soon?
[12:55:34] <wikibugs>	 10DBA, 10Wikidata: New item fails (Special and WEF tool) - https://phabricator.wikimedia.org/T183341#3851064 (10Marostegui) The host that was complaining was 10.64.48.25 which is s5 master. I was doing some SELECTs on the master as part of T161294 (which roughly matches those times). However, those SELECTs wer...
[12:57:30] <wikibugs>	 10DBA, 10Wikidata: New item fails (Special and WEF tool) - https://phabricator.wikimedia.org/T183341#3851077 (10Marostegui) >>! In T183341#3851061, @Billinghurst wrote: > if this is something that may recur, may I ask for the coordination of a more informative error message, or something that tells me to stop...
[13:07:37] <wikibugs>	 10DBA, 10Wikidata: New item fails (Special and WEF tool) - https://phabricator.wikimedia.org/T183341#3850657 (10jcrespo) The `LIMIT 1   FOR UPDATE` (plus what marostegui comments) indicates that is not a lag problem, but a contention problem (Error: 1205 Lock wait timeout exceeded)- many items wanting to lock...
[13:28:41] <wikibugs>	 10DBA, 10Analytics-Kanban: Purge all old data from EventLogging master - https://phabricator.wikimedia.org/T168414#3851221 (10elukey)
[13:30:41] <wikibugs>	 10DBA, 10Analytics-Kanban: Purge all old data from EventLogging master - https://phabricator.wikimedia.org/T168414#3363866 (10elukey) Logged activity on the parent task instead of this one :)  https://gerrit.wikimedia.org/r/398869 https://gerrit.wikimedia.org/r/399149 https://gerrit.wikimedia.org/r/399153  Men...
[13:31:30] <wikibugs>	 10DBA, 10Analytics-Kanban: Purge all old data from EventLogging master - https://phabricator.wikimedia.org/T168414#3363866 (10elukey) a:03elukey
[13:35:17] <elukey>	 so --^ should be completed after the first run of the sanitization script (in progress), then it will run daily as it already happens for the slave
[13:35:40] <elukey>	 this seems the last task, so after it is done we can close the parent one too
[14:33:59] <jynus>	 BTW, there was some slowdown on the consuming events- not affecting the db at all
[14:34:12] <jynus>	 but not sure if you were aware of it
[14:56:02] <jynus>	 can I add db1111 and 2 to tendril?
[15:37:53] <jynus>	 I am performing a partial backup of tendril db, not sure disks like that
[15:54:38] <marostegui>	 no, they are not on tendril
[15:54:55] <marostegui>	 I didn't put them there because I didn't want them to distract us
[15:55:14] <marostegui>	 feel free to add them yep
[15:55:27] <marostegui>	 I didn't want to add them until they were set up, to avoid having two servers in red there
[15:56:25] <jynus>	 ups https://tendril.wikimedia.org/tree
[15:58:17] <marostegui>	 hehe that's cool, they are set up already, although they are supposed to be s4 not s5 :)
[15:58:38] <jynus>	 hence the "ups"
[16:25:28] <wikibugs>	 10DBA, 10Wikimedia-Site-requests: Global rename of Makki98 → Bluemoon2999: supervision needed - https://phabricator.wikimedia.org/T183285#3851673 (10alanajjar) @Marostegui  We should wait on preform this request, because there's a large discussion about this user and another sysop who blocked him -incorrectly-...
[16:26:07] <wikibugs>	 10DBA, 10Wikimedia-Site-requests: Global rename of Makki98 → Bluemoon2999: supervision needed - https://phabricator.wikimedia.org/T183285#3851675 (10Marostegui) Cool thanks for the heads up
[17:29:43] <no_justification>	 jynus: Any chance you still need https://gerrit.wikimedia.org/r/#/c/338996/? I see your comment about it being a template, but since the related task was closed thought I'd check (doing a little wmf-config backlog grooming today)
[17:30:51] <no_justification>	 (if you still need it that's totally cool, I'm just checking :))
[17:50:26] <wikibugs>	 10DBA, 10Operations, 10hardware-requests, 10ops-eqiad, 10Patch-For-Review: Decommission db1015 - https://phabricator.wikimedia.org/T173570#3533347 (10MoritzMuehlenhoff) This host still shows up in puppetdb, i.e. misses the deactivate step (e.g. visible in https://servermon.wikimedia.org/hosts/)
[17:51:42] <wikibugs>	 10DBA, 10Operations, 10hardware-requests, 10ops-eqiad, 10Patch-For-Review: Decommission db1026 - https://phabricator.wikimedia.org/T174763#3572381 (10MoritzMuehlenhoff) This host still shows up in puppetdb, i.e. misses the deactivate step (e.g. visible in https://servermon.wikimedia.org/hosts/)
[17:52:13] <wikibugs>	 10DBA, 10Operations, 10hardware-requests, 10ops-eqiad, 10Patch-For-Review: Decommission db1044 - https://phabricator.wikimedia.org/T181696#3798986 (10MoritzMuehlenhoff) This host still shows up in puppetdb, i.e. misses the deactivate step (e.g. visible in https://servermon.wikimedia.org/hosts/)
[18:19:01] <wikibugs>	 10DBA, 10MediaWiki-extensions-Newsletter, 10Google-Code-in-2017, 10Patch-For-Review, 10Performance: List of Newsletters should have a column showing the number of issues - https://phabricator.wikimedia.org/T180979#3852097 (10Florian) Ok, during implementation and after chatting to @Bawolff about that, we...
[18:27:37] <RoanKattouw>	 So I'm reconstructing the replica on deployment-db04 and guess what I'm seeing
[18:27:45] <RoanKattouw>	 There's an impressive data set on beta labs
[18:27:48] <RoanKattouw>	 3 million edits
[18:27:54] <RoanKattouw>	 But it's on/from simplewiki
[18:28:10] <RoanKattouw>	 beta simplewikis is 10x the size of beta enwiki
[18:29:16] <RoanKattouw>	 https://simple.wikipedia.beta.wmflabs.org/wiki/Special:Statistics shows 232k pages, 3M edits, en is 152k/333k and most of that is probably browser test noies
[18:30:02] <RoanKattouw>	 I mean there's like complete pages with good content, templates, images, the works, e.g. https://simple.wikipedia.beta.wmflabs.org/wiki/Blackpool_tramway
[18:30:06] <RoanKattouw>	 I feel like I've just found a hidden treasure
[18:41:43] * twentyafterfour o/
[18:42:50] <greg-g>	 RoanKattouw: I believe we imported all of simplewiki at some point
[18:43:05] <RoanKattouw>	 Yeah I suspected as mch
[18:43:21] <RoanKattouw>	 You wouldn't get this kind of fidelity without a full import
[18:44:00] <twentyafterfour>	 RoanKattouw: or a lot of people got confused and they've been editing the wrong wiki this whole time
[18:45:49] <no_justification>	 Nah, we imported simplewiki
[18:45:52] <no_justification>	 :)
[18:48:25] <RoanKattouw>	 haha
[18:48:29] <RoanKattouw>	 Yeah that's great to see actually
[18:48:40] <RoanKattouw>	 People have been complaining that there isn't good data on the beta cluster
[18:48:45] <RoanKattouw>	 There is! It's just not where people think to look
[18:48:48] <Reedy>	 lol
[18:48:56] <Reedy>	 If only people could edit the wikis....
[18:49:35] <twentyafterfour>	 so is https://phabricator.wikimedia.org/ ready for replication restart?
[18:50:04] <twentyafterfour>	 I can try to fix the replicas if the masters are consistent then I think I know what to do
[18:50:24] <RoanKattouw>	 Wait, phab had DB issues too?
[18:50:29] <RoanKattouw>	 I'm just finishing that process for the beta wikis
[18:50:39] <twentyafterfour>	 sorry wrong link
[18:50:43] <twentyafterfour>	 T183252
[18:50:43] <stashbot>	 T183252: Unbreak replication in beta cluster - https://phabricator.wikimedia.org/T183252
[18:50:56] <twentyafterfour>	 I meant phabricator.wikimedia.org/T183252  heh
[18:51:47] <RoanKattouw>	 Oh yeah I'm just finishing that up
[18:51:53] <RoanKattouw>	 I think I've brought replication back on line
[18:51:59] <twentyafterfour>	 oh nice
[18:52:11] <RoanKattouw>	 See my log entries over in #wikimedia-cloud (and in the deployment-prep SAL)
[18:52:22] <RoanKattouw>	 I had to export the entire DB on db03 and import it on db04
[18:52:31] <RoanKattouw>	 Which failed initially because of a broken view that I had to delete
[18:52:40] <RoanKattouw>	 I'm now testing if the replication works correctly
[18:53:01] <twentyafterfour>	 tread carefully lest you get the unwanted job title "beta dba"
[18:53:11] <twentyafterfour>	 :-o
[18:53:32] <RoanKattouw>	 And it looks like it does
[18:53:34] <RoanKattouw>	 haha
[18:53:48] <RoanKattouw>	 Well someone had to fix it, and neither the DBA team nor the releng team seemed to want to
[18:54:29] <twentyafterfour>	 RoanKattouw: well thanks for your valient effort. I was recruited just a bit too late (only just became aware of the issue about 15 minutes ago)
[18:54:36] <RoanKattouw>	 Ha
[18:55:02] <RoanKattouw>	 As a post-mortem I do wonder how beta was broken for days without you being recruited
[18:55:56] <RoanKattouw>	 (I did most of the fixing yesterday, which involved figuring out what caused the problem together with Daniel+Adam, then bringing the master into a consistent state and depooling the replica)
[18:56:02] <greg-g>	 days? you mean 1?
[18:56:26] <RoanKattouw>	 I thought it had been broken since Monday?
[18:56:33] <RoanKattouw>	 Or did I hear that wrong and did it only start yesterday?
[18:56:34] <greg-g>	 oh?
[18:56:44] * greg-g was just looking at the unbreak beta dbs task
[18:57:53] <greg-g>	 yeah, looks like at least tues.
[18:58:58] <greg-g>	 tl;dr: I didn't see it until yesterday afternoon when I caught up with bugmail. Also, today's only Wed (I just remembered) :)
[18:59:12] <RoanKattouw>	 Tuesday is when I heard
[18:59:22] <RoanKattouw>	 But I can ask Elena when it started, she'll know, she uses beta daily
[19:00:16] <greg-g>	 I didn't see any discussion in -releng (a good way of making sure we see something, we all get a lot of bugmail)
[19:01:03] <greg-g>	 there's a complicating factor of "who can work on this part of beta, releng can't possibly know how to do everything" (See also our policy of services maintaince in beta)
[19:01:08] <wikibugs>	 10DBA, 10Beta-Cluster-Infrastructure, 10MW-1.31-release-notes (WMF-deploy-2018-01-02 (1.31.0-wmf.15)), 10Patch-For-Review: Unbreak replication in beta cluster - https://phabricator.wikimedia.org/T183252#3852204 (10Catrope) 05Open>03Resolved a:03Catrope I exported the data on db03, imported it on db04...
[19:01:37] <RoanKattouw>	 What/where is this policy?
[19:01:44] <RoanKattouw>	 And yes, fair enough, nobody pinged in -releng
[19:02:03] <greg-g>	 RoanKattouw: do you need it written down to realize 6 people can't keep up with hundreds?
[19:02:20] <greg-g>	 s/hundreds/at least 30/
[19:02:22] <greg-g>	 :)
[19:02:28] <greg-g>	 TechOps + Services
[19:02:54] <RoanKattouw>	 Well sure, that's understandable, but all beta wikis being entirely read-only is another matter
[19:03:06] <RoanKattouw>	 Is there some sort of policy about what is maintianed?
[19:03:11] <greg-g>	 yeah, for a day, which is longer than we'd hope
[19:03:34] <greg-g>	 No. Beta is the red-haired step child no one wants to maintain
[19:04:04] <greg-g>	 but we do because someone has to, even though we have no capacity for it (we have no Opsen to give Opsen level response times to outages, for instance)
[19:04:05] <twentyafterfour>	 would it be possible to snapshot the databases in a working state and then we could just easily restore them from that snapshot when things fail?
[19:04:20] <twentyafterfour>	 I mean, we could take a dump but doesn't openstack have snapshot capability?
[19:04:59] <RoanKattouw>	 Well what happened yesterday was quite strange
[19:05:11] <greg-g>	 tl;dr: yes, I agree the response time here was not great. :)
[19:05:59] <RoanKattouw>	 Strange in that several layers of protection all failed to prevent MW code from writing to a replica DB server
[19:06:10] * greg-g goes into a 1:1
[19:06:19] <RoanKattouw>	 Which resulted in a very messy situation that I don't think anyone knew how to clean up, I had to figure it out as I went
[19:08:30] <twentyafterfour>	 which is apparent from the task, it's an epic one
[19:08:39] <RoanKattouw>	 Well it took about a day
[19:08:55] <RoanKattouw>	 But yeah just figuring it out took quite some time
[19:10:41] <RoanKattouw>	 So there's not much that would have helped, other than ops/DBA having had documentation about how to repair replication
[19:10:46] <RoanKattouw>	 Which I had to piece together from MySQL docs
[19:11:37] <twentyafterfour>	 RoanKattouw: yeah, that part is a bit tricky. We should have some automation / scripts for that part
[19:11:49] <twentyafterfour>	 or a way to restore from a snapshot
[19:11:58] <RoanKattouw>	 It was harder in this case because there's only 1 replica
[19:12:17] <RoanKattouw>	 If we had 2 replicas, then I would have been able to repair the broken one by shutting down the non-broken one and snapshotting it
[19:12:30] <RoanKattouw>	 OTOH this bug would have hosed both replicas in different ways, so..
[20:44:34] <wikibugs>	 10DBA, 10MediaWiki-Platform-Team (MWPT-Q2-Oct-Dec-2017): Determine how to update old compressed ExternalStore entries for T181555 - https://phabricator.wikimedia.org/T183419#3852480 (10Anomie) p:05Triage>03Normal