[05:17:08] <wikibugs>	 10DBA, 10Epic, 10Patch-For-Review, 10codfw-rollout: Database maintenance scheduled while eqiad datacenter is non primary (after the DC switchover) - https://phabricator.wikimedia.org/T155099#4128987 (10Marostegui)
[05:17:12] <wikibugs>	 10DBA, 10MediaWiki-API: Database query error (internal_api_error_DBQueryError) while getting list=allrevisions - https://phabricator.wikimedia.org/T123557#4128988 (10Marostegui)
[05:17:19] <wikibugs>	 10DBA, 10Patch-For-Review: Rampant differences in indexes on enwiki.revision across the DB cluster - https://phabricator.wikimedia.org/T132416#4128985 (10Marostegui) 05Open>03Resolved db1066 is now fixed ``` root@neodymium:~# mysql -hdb1066.eqiad.wmnet enwiki -e "show create table revision\G" *************...
[06:43:39] <wikibugs>	 10DBA, 10Patch-For-Review: db1114 connection issues - https://phabricator.wikimedia.org/T191996#4129039 (10Marostegui) Some more food for thought. The errors happen _exactly_ every 10 minutes almost to the second.  Bursts after depooling it from main (according to logtash):  06:20:10 until 06:20:13 06:30:11 un...
[07:27:54] <wikibugs>	 10DBA, 10Patch-For-Review: db1114 connection issues - https://phabricator.wikimedia.org/T191996#4129117 (10Marostegui) So during the errors, normally around 5 seconds or so, there is a burst in connections, which almost double the normal amount of connections. Examples  Time and amount of hits on tcpdump to po...
[07:38:09] <jynus>	 did you get the full trace of those^?
[07:54:45] <marostegui>	 yes
[07:54:49] <marostegui>	 I am going thru them
[07:56:18] <marostegui>	 it is amazing how it is 10,20,30… and the second between 10 and 13
[07:57:47] <jynus>	 yes, I said it was bursty
[07:57:54] <marostegui>	 but even to the second?
[07:58:06] <jynus>	 to the second probably means internal
[07:58:08] <marostegui>	 it cannot be normal traffic
[07:58:11] <marostegui>	 exactly
[07:58:16] <jynus>	 cronjob for wikidata?
[07:58:28] <marostegui>	 so far I am only seeing mw hosts
[07:58:32] <marostegui>	 connecting to it
[07:58:54] <jynus>	 yes, but we need to know the queries
[07:59:01] <jynus>	 http and sql
[07:59:12] <jynus>	 although if they are failing to connect, no query yet
[07:59:18] <marostegui>	 exactly
[07:59:22] <marostegui>	 there are no queries during those seconds
[07:59:26] <marostegui>	 just connecitons
[08:03:29] <marostegui>	 So during the seconds it lasts I see connections from only mw hosts, db1052 and db1115 (tendril)
[08:03:38] <marostegui>	 nothing like terbium or stuff like that
[08:04:15] <jynus>	 that is strange, I assume tendril is not a large one contributing to it
[08:04:30] <jynus>	 otherwise there could be serious issues there with connections
[08:08:17] <marostegui>	 I am trying to think what could we disable to try to isolate traffic or something
[08:12:20] <marostegui>	 maybe we can involve arzhel to see if he can see something at switch level or something
[08:13:57] <jynus>	 I would put the server offline and do some tests
[08:14:05] <jynus>	 offline == depool
[08:14:28] <marostegui>	 but if we depool it from API only, errors are gone
[08:14:58] <marostegui>	 that is the weird thing, that whatever it is, it is not moving to another API server
[08:16:21] <jynus>	 that is why I want to try to break it or do some tests
[08:16:54] <jynus>	 maybe it gets overloaded and the query killer kills stuff, etc.
[08:19:11] <marostegui>	 let's depool it from everywhere, and see if there is anything arriving to it at those times XX:20:10, XX:30:10 etc
[08:19:23] <marostegui>	 just in case something is hardcoded or internal or something
[08:20:55] <wikibugs>	 10DBA, 10Patch-For-Review: db1114 connection issues - https://phabricator.wikimedia.org/T191996#4129203 (10Marostegui) From the captures, I only see "normal" traffic as in: mw hosts, db1052 (the master) and db1115 (tendril).
[08:36:43] <marostegui>	 I am going to run some tests, tendril on db1114 will fail
[08:36:50] <marostegui>	 for a few minutes
[09:14:58] <wikibugs>	 10DBA, 10Patch-For-Review: db1114 connection issues - https://phabricator.wikimedia.org/T191996#4129264 (10Marostegui) We can now discard tendril for sure as a cause of this (it was hard to believe it was it anyways, but better to confirm it). I used iptables to drop all the traffic coming from tendril DB (db1...
[10:11:35] <jynus>	 I am going to take a break for an early luch while I wait fo es1013 buffer pool to warm up
[10:12:03] <marostegui>	 enjoy!
[10:19:36] <wikibugs>	 10DBA, 10MediaWiki-extensions-WikibaseRepository, 10Wikidata, 10Wikidata-Ministry-Of-Magic-Tech-Debt, and 3 others: Investigate optimzing wb_terms - https://phabricator.wikimedia.org/T188279#4129310 (10WMDE-leszek)
[10:21:15] <wikibugs>	 10DBA, 10MediaWiki-extensions-WikibaseRepository, 10Wikidata, 10Patch-For-Review, and 2 others: Investigate optimzing wb_terms - https://phabricator.wikimedia.org/T188279#4002167 (10WMDE-leszek)
[10:40:29] <wikibugs>	 10DBA, 10Patch-For-Review: db1114 connection issues - https://phabricator.wikimedia.org/T191996#4129334 (10Marostegui) >>! In T191996#4126282, @Marostegui wrote: > This host has dropped around 300 packets in 15h or so. > Yesterday I checked the amount of drops in its interface and it was 1815, today it is 2103...
[10:56:52] <jynus>	 I cannot pool es1013 yes, its cache is still at -243% efficiency
[10:57:15] <jynus>	 sorry, that was a mistake
[10:57:21] <jynus>	 I meant -248% efficiency
[10:58:04] <jynus>	 I think that means that for every request made, 3-4 are found not on cache
[11:15:15] <hashar>	 jynus: during my lunch break I went adding tox and fixing up stuff for  wmfmariadbpy :  https://gerrit.wikimedia.org/r/#/c/426004/ :]
[11:20:04] <jynus>	 thanks for that!
[11:21:02] <jynus>	 that is extremely helpful
[11:23:06] <jynus>	 check_health and compare were not really production ready, but you already fixed them, too!
[11:24:38] <jynus>	 hashar: do you have any direction for how to do test that require heavy backend (sysop) setup?
[11:24:46] <jynus>	 *tests
[11:25:14] <jynus>	 e.g. setup a mariadb server with a fake copy of enwiki data
[11:26:07] <jynus>	 * fully automatic tests (obviously, setup can be done manually)
[12:14:10] <hashar>	 jynus: I have no clue what the software is doing. But potentially the integration tests coudl take care of spinning a mariadb instance and populate it with test data
[12:14:48] <hashar>	 to run against a real backend with real data, I guess that can be done manually
[12:15:02] <hashar>	 (sorry I went out to bring kids to school and walk a bit)
[12:15:32] <jynus>	 I am not sure that is really feasable- we are talking 1-2 hours to setup the database
[12:15:59] <jynus>	 could that be a static resource á la beta
[12:16:23] <jynus>	 so only the frontend is setup each time?
[12:17:02] <jynus>	 I am basicaly a bit lost, and if you have any idea, not something prioritary right now
[12:18:44] <jynus>	 (you don't have to know, just in case that you said "oh, we do this like that for the similar thing Y")
[12:31:12] <hashar>	 jynus: surely we could have CI to point to a database hosted on labs, a bit like the toollabs replica maybe
[12:31:31] <hashar>	 or just manually run it against an existing db
[12:31:55] <hashar>	 for CI, possibly one could use a generated set of data. Maybe there is no need to have millions of rows in a test database
[12:32:26] <jynus>	 yes, unit/validation tests are ok for CI
[12:32:44] <jynus>	 installing mysql and pregenerating <1MB of data is very fast
[12:33:08] <hashar>	 which would cover a few functionalities already
[12:33:16] <jynus>	 but certain tools there will have to take care of complex topology changes
[12:33:29] <jynus>	 and data provisioning
[12:33:31] <hashar>	 I should be able to craft a container that has mariadb.deb shipped then the suite could spawn a mariadb server
[12:33:53] <hashar>	 but for the heavy testing, most probably you want to run it manually against a controlled environment
[12:34:34] <jynus>	 after all, these are sysop tools, not web tools which means they have higher dependencies
[12:34:42] <jynus>	 not higher
[12:34:44] <jynus>	 deeper
[12:36:01] <hashar>	 ideally we would even have a job that does not pip install at all and uses .deb packages instead :D
[12:36:33] <jynus>	 oh, I agree with that
[12:36:37] <hashar>	 that can be done "easily" by using a debian package
[12:36:46] <hashar>	 and have the debian building toolchain to run the tests in a chroot
[12:37:07] <hashar>	 which would have no network acccesss and all python modules installed based on the Depends: field in debian/control
[12:37:24] <hashar>	 so eg one can do the development with random pip installed dependencies
[12:37:39] <hashar>	 and when the packaging work happen, the test get to run in a more controlled environment that match what is on prod
[12:37:44] <hashar>	 (well more or less)
[12:37:50] <jynus>	 yes
[12:38:42] <jynus>	 I guess there could be 3 levels, CI, staging (production size) and production?
[12:39:14] <jynus>	 only full releases go to staging
[12:39:31] <jynus>	 and only staging-approved ones go to production
[12:39:40] <jynus>	 (manually)
[12:41:28] <hashar>	 in the SSD pipeline project (which basically overall everything)  that is more or less the idea
[12:41:52] <hashar>	 the software will run in more or less the same environment either locally, on CI, in staging and in production
[12:42:34] <hashar>	 I guess once a version is polished and works fine in CI, a release candidate is cut
[12:42:46] <jynus>	 SSD for us is disks, what does that really mean for you?
[12:42:47] <hashar>	 then the release candidate is tested on staging with a real dataset
[12:42:56] <hashar>	 and once polished up, a final can be cut and moved to prod
[12:43:08] <hashar>	 Strealined Software Deployment iirc
[12:43:16] <jynus>	 oh, that is another question, but more general
[12:43:18] <hashar>	 or maybe the project is now named "deployment pipeline"
[12:43:24] <jynus>	 should staging have real data or not?
[12:43:47] <hashar>	 but the idea is that eventually one send a patch to gerrit, vote +2 and stuff get tested and ultimately deployed to prod  automatically
[12:43:48] <jynus>	 it wasn't clear on my department
[12:44:04] <hashar>	 for staging I dont know. That depends on what we want to test
[12:44:05] <jynus>	 if test production-sized databases
[12:44:21] <jynus>	 sould be a brokable replica or one with fake data
[12:44:22] <hashar>	 most probably it is easier to just have a snapshot that get refreshed from time to time
[12:44:26] <hashar>	 with private data stripped
[12:44:51] <hashar>	 or yes maybe fake data is good enough. Then it might be challenging to generate good fake data
[12:44:58] <jynus>	 well, that is the more difficult of the 3
[12:45:02] <hashar>	 when a curated snapshot  is probably easier to setup
[12:45:14] <jynus>	 (real data with everything private stripped)
[12:46:01] <jynus>	 we should talk more about aims for that in the future
[12:46:22] <hashar>	 I guess we can use the ops mailling list for that
[12:46:38] <hashar>	 (and people not interested can just mute the thread )
[12:48:23] <hashar>	 wmfmariadbpy now has tox running in CI:  https://integration.wikimedia.org/ci/job/tox-docker/1622/  :]
[12:49:55] <jynus>	 thank you again!
[12:50:06] <hashar>	 jynus: and if you ever want to spawn a mariadb database with the current user and a file socket to write to:  https://github.com/wikimedia/integration-quibble/blob/master/quibble/backend.py#L138-L159
[12:50:32] <hashar>	 (which really should only take 6 seconds for you to figure out, but took me a good chunk of time to get right)
[12:50:38] <jynus>	 oh, I didn't know that existed
[12:50:58] <jynus>	 which db does it create,etc.?
[12:51:01] <hashar>	 that is to spawn a mariadb instance in the background and then have mediawiki installed using it
[12:51:09] <jynus>	 (version?)
[12:51:14] <hashar>	 so we can have multiple mariadb spawned when multiple jobs run on the same instance
[12:51:36] <hashar>	 then I create database, GRANT stuff  https://github.com/wikimedia/integration-quibble/blob/master/quibble/backend.py#L121-L136
[12:51:57] <hashar>	 and finally can run  the mediawiki installer against that  using something like php maintenance/install.php --dbpath=/tmp/mariadb.socket
[12:51:59] <hashar>	 \o/
[12:52:28] <hashar>	 so then i can run multiple mediawiki tests in parallel on the same host, and each has its own little database!
[12:53:03] <jynus>	 this may actally also be helpful to that project in the future
[12:53:14] <hashar>	 so for wmfmariadbpy  integration tests, potentially the integration tests could try to setup a mariadb instance for itself
[12:53:26] <hashar>	 it could even be a fresh one for each test (to be setup in  setUp() and killed in tearDown()
[12:53:27] <jynus>	 the idea is to abtract common db tasks
[12:53:54] <hashar>	 and then get an option to point to an existing database using environment variable like:  DATABASE=foo
[12:54:13] <hashar>	 if DATABASE is set,  setUp() would not spawn a fresh instance but use whatever has been indicated via env variable
[12:54:28] <hashar>	 so on CI, with no database, the integration test suite would spawn the db and run tests against it 
[12:54:45] <hashar>	 locally you would be able to run the integration tests against an existing instance, simply by setting an env variable
[12:55:02] <hashar>	 that is all in theory obviously
[12:55:26] <hashar>	 feel free to reuse the code at https://github.com/wikimedia/integration-quibble/blob/master/quibble/backend.py
[12:55:42] <hashar>	 though there is a bit of boilerplate to make it act as a context manager. So I can do something like:
[12:55:46] <hashar>	 with MySQL():
[12:55:49] <hashar>	   php install.php
[12:55:52] <hashar>	    php phpunit.php
[12:55:57] <hashar>	 print("done")
[12:56:19] <hashar>	 (and mysql get killed when the context is exited, probably before "done" get printed
[13:55:35] <jynus>	 es hosts seem quite happy with 10.1, so I may reimage codfw masters next week
[14:27:30] <wikibugs>	 10DBA, 10Patch-For-Review: db1114 connection issues - https://phabricator.wikimedia.org/T191996#4129748 (10Marostegui) After changing the port configuration and for the records, this is what the interface is showing ```  RX errors 0  dropped 2487  overruns 0  frame 0 ```  We'll see what happens once the server...
[14:28:58] <wikibugs>	 10DBA, 10Operations, 10netops: db1114 connection issues - https://phabricator.wikimedia.org/T191996#4129750 (10jcrespo) Adding the tag to reflect work done at network layer.
[15:08:05] <wikibugs>	 10DBA, 10Operations, 10netops, 10Patch-For-Review: db1114 connection issues - https://phabricator.wikimedia.org/T191996#4129814 (10ayounsi) ```name=db1114 ethtool eno1 Supported pause frame use: No Advertised pause frame use: Symmetric Link partner advertised pause frame use: No ``` ```name=db1114's switch...
[15:09:22] <wikibugs>	 10DBA, 10Operations, 10netops, 10Patch-For-Review: db1114 connection issues - https://phabricator.wikimedia.org/T191996#4129816 (10Marostegui) @ayounsi thanks for your help. If you want to compare it with the other two servers that receive exactly the same traffic, those are: db1066 and db1080.
[16:26:42] <wikibugs>	 10DBA, 10Operations, 10netops, 10Patch-For-Review: db1114 connection issues - https://phabricator.wikimedia.org/T191996#4130018 (10ayounsi) 1/ Flow-control not helping, reverted  2/ Are the other servers seeing the same bursts of inbound sessions?  3/ The `ifconfig` input drop counter matches the nic stats...
[16:27:07] <wikibugs>	 10DBA, 10Operations, 10netops, 10Patch-For-Review: db1114 connection issues - https://phabricator.wikimedia.org/T191996#4130019 (10Marostegui) So given that db1066 and db1080 have the same traffic than db1114 (and even more when db1114 gets depooled from API) and they don't suffer any kind of issues, could...
[16:34:14] <wikibugs>	 10DBA, 10Operations, 10netops, 10Patch-For-Review: db1114 connection issues - https://phabricator.wikimedia.org/T191996#4130039 (10Marostegui) >>! In T191996#4130018, @ayounsi wrote: > 1/ Flow-control not helping, reverted >   Cool  > 2/ Are the other servers seeing the same bursts of inbound sessions?  Th...
[16:43:04] <wikibugs>	 10DBA, 10Operations, 10netops, 10ops-eqiad, 10Patch-For-Review: db1114 connection issues - https://phabricator.wikimedia.org/T191996#4130046 (10Marostegui)
[22:10:30] <wikibugs>	 10DBA, 10Data-Services, 10Dumps-Generation, 10MediaWiki-Platform-Team: Configure Toolforge replica views and dumps for the new MCR tables - https://phabricator.wikimedia.org/T184446#4130987 (10Bstorm) So should `content.content_address` be NULL in replica views?  Just trying to clarify how the comments her...
[23:30:03] <wikibugs>	 10DBA, 10Data-Services, 10Dumps-Generation, 10MediaWiki-Platform-Team: Configure Toolforge replica views and dumps for the new MCR tables - https://phabricator.wikimedia.org/T184446#4131129 (10daniel) content.content_address doesn't have to be nulled. There will just be no mechanism on labs for resolving t...