[00:00:27] 10Traffic, 10Operations, 10ops-ulsfo, 10Patch-For-Review: cp4024 kernel errors - https://phabricator.wikimedia.org/T174891#3739332 (10RobH) They advised we needed to open a support case, so I did so, 956261134. They're following up and will let us know. [03:19:32] 10Traffic, 10Operations, 10Patch-For-Review: Better handling for one-hit-wonder objects - https://phabricator.wikimedia.org/T144187#3739681 (10Nuria) This ticket should be a talk, really. [06:07:52] let's use azure to serve azure's status page, shall we? https://puck.nether.net/pipermail/outages/2017-November/010957.html [06:59:13] yeee! [06:59:16] https://logstash.wikimedia.org/app/kibana#/discover/188b07b0-c389-11e7-a44b-9b945870b167?_g=(refreshInterval%3A(display%3AOff%2Cpause%3A!f%2Cvalue%3A0)%2Ctime%3A(from%3Anow-3h%2Cmode%3Arelative%2Cto%3Anow)) [07:00:52] 10Traffic, 10Operations, 10Wikimedia-Logstash, 10Patch-For-Review: Add varnish logs to logstash - https://phabricator.wikimedia.org/T63782#3739791 (10ema) 05Open>03Resolved a:03ema [[https://logstash.wikimedia.org/app/kibana#/discover/188b07b0-c389-11e7-a44b-9b945870b167?_g=(refreshInterval%3A(displa... [07:17:49] moritzm: pinkunicorn rebooted with 4.9.51 and the latest openssl, looks good. [07:40:32] ema: yep, latest openssl 1.1 was already live since last Thursday, BTW. there were no differences in an ssllabs runs against it [08:12:55] moritzm: isn't it 1.0.2m? [08:15:15] nginx uses the 1.1 packages, 1.0.2 is only used by the common low level daemons (sshd, nrpe, diamond, lldpd, prometheus) [08:41:50] 10netops, 10Operations: Allow syslog-tls in analytics towards wezen/lithium - https://phabricator.wikimedia.org/T177821#3739933 (10MoritzMuehlenhoff) [08:42:00] 10netops, 10Operations: Allow syslog-tls in analytics towards wezen/lithium - https://phabricator.wikimedia.org/T177821#3671479 (10MoritzMuehlenhoff) p:05Triage>03Normal [09:25:33] right, 1.0.2m wasn't upgraded yet though [09:30:35] yeah, I'm rolling it out service-by-service and skipped cp* since you mentioned the dist-upgrades [12:47:25] https://www.theverge.com/2017/11/6/16614160/comcast-xfinity-internet-down-reports [12:48:18] moritzm: cache_misc rebooted, all good [12:48:29] nice, thx [14:10:45] we haven't been running vtc tests as part of the varnish build process! amended https://gerrit.wikimedia.org/r/#/c/389516/ to fix that [14:18:15] ema: you're building a new 5.1.3, or already did? [14:18:42] bblack: I've built it locally, yeah [14:18:49] I was gonna say, I did add VTC to the transaction timeouts patch, it's just missing conversion to debian/patches/ [14:18:57] want to roll it in? [14:19:04] jenkins is probably currently working hard on the tests now :) [14:19:07] bblack: yeah sure [14:19:35] ok [14:19:46] I'll redo the patch as a debian/patches/ and push back to gerrit again [14:20:03] also, we can now see varnish backend restarts in kibana: [14:20:04] https://logstash.wikimedia.org/app/kibana#/discover/b1dbea50-c3c6-11e7-9a7b-5d594b97194d?_g=(refreshInterval%3A(display%3AOff%2Cpause%3A!f%2Cvalue%3A0)%2Ctime%3A(from%3Anow-12h%2Cmode%3Arelative%2Cto%3Anow)) [14:20:20] those bars there are the cache_misc reboots [14:20:56] nice [14:25:56] ema: https://gerrit.wikimedia.org/r/#/c/387236/ [14:27:43] although now that I think about it, I did the built-sources bits according to autoconf/automake/git, not according to debian, maybe [14:28:33] (as in, if configure->make rebuilt a generated file which was originally tracked in git, I added that output to the patch. I'm not sure if that's right or not under debian packaging, could cause some debian package build problem that requires removing some of the generated files from the patch) [14:31:14] hmmm and jenkins failed it. looking at the extremely verbose output there [14:32:39] ah [14:32:45] lintian output: 14:28:56 E: varnish: bad-provided-package-name varnishabi-strict-NOGIT [14:32:55] (doesn't like uppercase in package names) [14:33:25] we might have to suppress that if jenkins is running a package build against un-merged commits that generate weird package names [14:33:30] where does that NOGIT come from? [14:34:22] jenkins did build https://gerrit.wikimedia.org/r/#/c/389516/ correctly, so what's the difference between that and the other CR? [14:35:08] lib/libvcc/generate.py: [14:35:10] if os.path.isdir(os.path.join(srcroot, ".git")): [14:35:12] [...] [14:35:19] ah! [14:35:22] else: b = "NOGIT" v = "NOGIT" [14:35:37] that generator is one that I ran for my patch though [14:35:49] maybe my patch itself ended up encoding the NOGIT part somehow? [14:36:33] doesn't seem so, though [14:36:36] was that one of the files generated by configure/make? [14:36:40] yeah [14:36:44] well [14:37:00] generate.py isn't generated by configure/make, but it's run by configure/make to generate other files [14:37:23] some of which have their generated outputs tracked in our/upstream git, so I included the generated output diffs in the patch as well [14:37:47] ema: if jenkins times out building the varnish package let me know. The timeout can be bumped [14:37:50] (which I philosophically disagree with, but there are times it's the most pragmatic option I guess) [14:38:24] hashar: it took ~ half an hour but managed in the end to build 389516 [14:38:30] ouch [14:39:33] it's the VTC tests that take forever apparently: [14:39:35] 14:13:07 PASS: tests/a00000.vtc [14:39:36] [...] [14:39:44] ema: gimme a few, I'll go build the deb on copper with my patch cherrypicked over and try to figure out what goes wrong [14:39:50] 14:37:02 PASS: tests/v00051.vtc [14:39:53] the VTC take forever always [14:39:59] ema: most probably they can be made to run in parallel. [14:40:02] a lot of them have sleeps/delays to test timing stuff [14:40:21] (and probably can't be parallelized easy, they might stomp on each other's test ports) [14:40:50] (but, I guess I've never tried to parallelize it, either) [14:42:10] bblack: s/copper/boron/ :) [14:42:17] lol [14:42:25] https://gerrit.wikimedia.org/r/389717 Set operations/debs/varnish4 timeout to 60 minutes [14:42:32] my poor command history from copper, I'll be lost without it! :) [14:42:40] (this way I can forget about that job :D ) [14:42:51] hashar: :) [14:43:40] hashar: it would be absolutely fantastic to have jenkins comment on the CR that it has started building the project, possibly with a link to the build logs [14:44:09] would that be doable? [14:44:15] we can make it comment for sure [14:44:27] but that would end up being quite spammy in Gerrit if we do that for all jobs/repos [14:44:39] yeah [14:44:53] and when it enqueue the change, it doesn't not about the jenkins job url yet until the job starts [14:45:06] maybe the feature could be enabled only for long running builds? [14:45:08] but you can find it on https://integration.wikimedia.org/zuul/ [14:45:31] yeah but I always get lost there! [14:45:43] and I think we had the idea of adding some javascript in Gerrit that would dynamically fetch from that page to display the job for the change being looked at [14:46:09] when the zuul page is busy, you can use the Filters: [_________] field [14:46:14] eg Filters: [varnish_____] [14:46:20] I'm testing the effects of simple parallel make on the testsuite now [14:46:24] BTW on my workstation the whole build takes 4 minutes [14:46:41] :( [14:46:43] instead of 27 [14:47:23] ema: maybe your make is automatically parallelizing the testsuite due to some difference of defaults or env vars? [14:48:06] my testing so far is saying that parallel make is making a big difference and does work [14:48:07] bblack: I do see -j8 in my build logs [14:48:35] at -j8 on my laptop, my "time" timings are: [14:48:36] real 3m6.728s [14:48:37] user 1m27.328s [14:48:37] sys 0m11.448s [14:48:41] and at -j20 it's: [14:48:47] real 1m30.768s [14:48:47] user 1m49.944s [14:48:47] sys 0m13.868s [14:49:10] so basically, there's so much sleeping going on, that not until -j20 did it manage to really on-average use slightly more than a CPU core [14:49:13] I think the instances onyl have 2 cpus though [14:49:19] hashar: see above [14:49:29] but yeah if that is mostly due to sleep.. :] [14:49:32] \o: [14:49:46] my laptop has 2 cpu cores too, and probably slower than CI :) [14:50:13] anyways, it will probably vary by environment a bit [14:50:31] but -j8 is "use half a cpu or so", -j20 is "use slightly more than 1 cpu" [14:50:40] (for varnishd VTCs anyways) [14:59:15] hashar: can we give a try to -j and see what happens? :) [15:15:57] ema: I have no clue how the debian packaging toolchain lets one set -j [15:23:27] probably via env vars [15:24:14] e.g. MAKEFLAGS=-j16 or something? [15:25:20] I'm trying that now on boron, to see if MAKEFLAGS at outer scope survives down into the pbuilder build and affects things [15:25:45] as always, there's like 32 million ways to do that [15:26:23] https://nthykier.wordpress.com/2016/09/11/debhelper-10-is-now-available/ seems debhelper 10 invokes dh with --parallel by default [15:26:33] you know your packaging system is often when it shares attributes with legacy Perl :) [15:26:36] https://en.wikipedia.org/wiki/There%27s_more_than_one_way_to_do_it [15:26:44] gotta bump debian/compat to 10 and debhelper >=10 in the build-deps [15:26:49] hashar: we do explicitly call dh with --parallel [15:26:57] then I am not sure dh --parallel actually causes make to -j something [15:26:59] ah [15:27:20] hmmm [15:27:36] make something else in the CI environment is setting some make parallelism limit? [15:28:45] a way to do it is DEBBUILDOPTS="-j12" in /etc/pbuilderrc [15:29:15] so -j12 will be passed to dpkg-buildpackage [15:29:32] doesn't dh --parallel already do "something"? [15:29:43] maybe it checks /proc/cpuinfo and sets -j=NCPUS? [15:30:32] maybe we need a DH_MAKE_PARALLEL_CPU_MULTIPLIER patch :) [15:31:19] maybe we can try: DEB_BUILD_OPTION="parallel=12" [15:32:09] hashar: DEB_BUILD_OPTION*S* [15:32:31] which of course differs from pbuilder's DEBBUILDOPTS in fundamental ways [15:33:12] every time I stare at {git-,}pbuilder, pdebuild, debuild, dpkg-buildpackage, I come to the conclusion that the world is a terrible place [15:33:24] +1 [15:33:54] it's very reminiscent of the autotools mess [15:34:20] someone built an over-complex tool with some design flaws to help accomplish a complex thing slightly better [15:34:25] now I've tried setting DEBBUILDOPTS="-j12" in my pbuilderrc and this is what happened: [15:34:33] and someone built another over-complex tool with some design flaws as a new layer on top of that one [15:34:35] pbuilder build --debbuildopts --debbuildopts '-j12 '-j8' '-us' '-uc' [...] [15:34:48] recurse several times until human frustration limits kick in and prevent further layer growth [15:35:25] have you seen what happened there with --debbuildopts? Why can't we have nice things? [15:36:00] https://gerrit.wikimedia.org/r/389725 Build operations/debs/varnish4 with parallel=12 [15:36:07] I am gonna deploy that and see what happens :D [15:37:25] maybe I should write a new tool called autoeverything which has a new DSL that's turing-complete and is compiled to Lua code which generates configure.ac+Makefile.am files for a project :P [15:37:47] hmm [15:38:03] dont forget a Dockerfile and some kind of rest api [15:40:18] and a debian package which creates a k8s cluster on install to use for automatic integration testing of the rest of the stack [15:41:11] and a README for that which says to use: curl https://github.com/foo/bar | sh to install the debian package and do the extra setup steps [15:42:26] sounds like VC startup material [15:43:23] (the secret VC pitch slide deck talks about the revenue from selling access to the DDoS botnet the tools create) [15:45:28] FWIW on boron, I took the current varnish4 repo + cherrypick of my patch and did: [15:45:31] DIST=jessie ARCH=amd64 WIKIMEDIA=yes gbp buildpackage --git-pbuilder --git-debian-branch=debian-wmf --git-upstream-branch=upstream --git-upstream-tree=branch --git-export-dir=../build-area --git-dist=jessie -us -uc [15:45:41] CI updated [15:45:42] and the build succeeded (no NOGIT issues) [15:46:39] now I'm trying the same with my patch + ema's wm2 patch that runs tests (going much slower) [15:47:15] https://integration.wikimedia.org/ci/job/debian-glue/830/parameters/ has DEB_BUILD_OPTIONS=parallel=12 [15:47:37] does the CI debian-glue stuff actually run the dpkg stuff from a git checkout, or is it very different? [15:48:15] bblack: define dpkg stuff? :D [15:48:17] no really [15:48:20] it just a few wrappers [15:48:24] 14:27:36 make[5]: Entering directory '/tmp/buildd/varnish-5.1.3/bin/varnishd' [15:48:28] that git clone the repo / checkout branches appropriately [15:48:32] then invoke gbp buildpackage [15:48:37] and rely on cowbuilder [15:48:59] with Alexandros modules/package_builder apt hook being injected somehow via .pbuilderrc (iirc) [15:49:02] https://integration.wikimedia.org/ci/job/debian-glue/830/console looks faster [15:50:00] bblack: yeah that /tmp/buildd is in the cowbuilder chroot [15:50:19] and cowbuilder really invokes pbuilder under the hood [15:50:37] ema: done to 4 minutes ( 00:04:05.494 Finished: SUCCESS ) [15:50:44] you guys are magic gurus really [15:50:56] solved by passing DEB_BUILD_OPTIONS=parallel=12 [15:54:29] hashar: yay! thanks! [15:59:13] https://upload.wikimedia.org/wikipedia/commons/a/a6/Population_density_countries_2017_world_map%2C_people_per_sq_km.svg [15:59:25] ^ kind of an interesting data perspective when thinking about future cache edges [15:59:52] it's a little different than internet population maps, but maybe we should think more about people and less about how-well-connected they are, I donno [16:02:44] internet pop/penetration: https://i.kinja-img.com/gawker-media/image/upload/192srcean0ybhpng.png [16:02:48] that calls for a Monaco edge pop :-) [16:03:54] that defends esams at least ;) [16:05:37] bblack: ok to start text/upload kernel/openssl upgrades and reboots? [16:06:01] https://thumbnails-visually.netdna-ssl.com/GlobalInternetInfrastructure_52b1860649c71_w1500.jpg [16:06:04] ema: yes [16:16:14] 10Traffic, 10Operations, 10ops-ulsfo: decom cp40(09|1[078]) - https://phabricator.wikimedia.org/T178815#3741664 (10RobH) [16:27:49] ema: there's 2x uploads in ulsfo w/ heavy mbox lag btw, might be problematic. I donno how soon they are in your reboots though... [16:29:58] bblack: perhaps I should reboot them sooner rather than later? They're pretty far down the list at the moment [16:30:22] that or we can just restart varnishd now and let it reboot whenever later [16:30:26] whatever's easier [16:31:02] easier to reboot them first, I'll go ahead [16:31:33] 10Traffic, 10Operations, 10ops-esams: cp4043 disk failure - https://phabricator.wikimedia.org/T179953#3741749 (10RobH) [16:31:53] 10Traffic, 10Operations, 10ops-esams: cp3043 disk failure - https://phabricator.wikimedia.org/T179953#3741763 (10BBlack) [16:45:35] bblack: do you want to amend my 5.1.3 patch with your changes or merge yours separately? [16:53:32] ema: whatever works best for you really. [16:53:54] we have a conflict on patches/series and the 0006 number anyways heh [16:58:33] ok let's go for two separate patches then :) [17:21:32] 10Traffic, 10Operations, 10ops-esams: cp3043 disk failure - https://phabricator.wikimedia.org/T179953#3741889 (10RobH) I have requested parts dispatch SR956320029. Once they notify me of shipment, I'll open an inbound shipment request with EvoSwitch, as well as a smart hands ticket for them to swap the SSD... [17:21:39] 10Traffic, 10Operations, 10ops-esams: cp3043 disk failure - https://phabricator.wikimedia.org/T179953#3741890 (10RobH) a:05mark>03RobH [17:40:25] !log stop cache_text/upload rolling reboots, resuming tomorrow [17:40:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:40:39] o/ [17:43:40] cya :) [18:03:26] 10netops, 10Operations: Allow syslog-tls in analytics towards wezen/lithium - https://phabricator.wikimedia.org/T177821#3742155 (10fgiunchedi) While investigating with @ayounsi it emerged that regular syslog is also blocked from analytics to prod. Please allow udp/514 as well, thanks! [18:04:14] 10netops, 10Operations: Allow syslog-tls and syslog in analytics towards wezen/lithium - https://phabricator.wikimedia.org/T177821#3742159 (10fgiunchedi) [18:10:40] 10netops, 10Operations: Allow syslog-tls and syslog in analytics towards wezen/lithium - https://phabricator.wikimedia.org/T177821#3742177 (10ayounsi) 05Open>03Resolved a:03ayounsi Done. [20:38:26] 10Traffic, 10Operations, 10Phabricator, 10Patch-For-Review: Phabricator needs to expose notification daemon (websocket) - https://phabricator.wikimedia.org/T112765#3742647 (10mmodell) 05Open>03Resolved a:03BBlack YAY! it only took 2.16 years! [20:44:19] 10Traffic, 10Operations, 10Phabricator, 10Patch-For-Review: Phabricator needs to expose notification daemon (websocket) - https://phabricator.wikimedia.org/T112765#3742669 (10mmodell) @bblack spent a bunch of time debugging issues with websockets + varnish, so thanks a lot for your time and expertise, Bran... [21:17:35] 10Traffic, 10Operations, 10ops-ulsfo, 10Patch-For-Review: cp4024 kernel errors - https://phabricator.wikimedia.org/T174891#3742811 (10RobH) I've been in discussion with Renny@Dell support. We lowered the CPU count from all to just 2 per CPU. The error still happened during the OS install and boot just no... [22:17:22] 10Traffic, 10Operations, 10Phabricator, 10Release-Engineering-Team (Kanban): Verify that the codfw lvs is configured correctly for Phabricator - https://phabricator.wikimedia.org/T168699#3742977 (10mmodell) [22:17:49] 10Traffic, 10Operations, 10Phabricator, 10Release-Engineering-Team (Kanban): Verify that the codfw lvs is configured correctly for Phabricator - https://phabricator.wikimedia.org/T168699#3373930 (10mmodell) https://gerrit.wikimedia.org/r/#/c/389871/ [23:30:36] 10Traffic, 10Operations, 10ops-ulsfo, 10Patch-For-Review: cp4024 kernel errors - https://phabricator.wikimedia.org/T174891#3743155 (10RobH) Dell seems to agree, they are dispatching a replacement mainboard. I'll swap and we'll see what happens.