• 1 Post
  • 6 Comments
Joined 2 years ago
cake
Cake day: July 28th, 2023

help-circle

  • Don’t seem to be any disk reads on request at a glance, though that might just be due to read caching on OS level. There’s a spike on first page refresh/load after dropping the read cache, so that could indicate reading the file in every time there’s a fresh page load. Would have to open the browser with call tracing to be sure, which I’ll probably try out later today.

    For my other devices I use unbound hosted on the router, so this is the first time encountering said issue for me as well.


  • You’re using software to do something it wasn’t designed to do

    As such, Chrome isn’t exactly following the best practices either – if you want to reinvent the wheel at least improve upon the original instead of making it run worse. True, it’s not the intended method of use, but resource-wise it shouldn’t cause issues – at this point one would’ve needed active work to make it run this poorly.

    Why would you even think to do something like this?

    As I said, due to company VPN enforcing their own DNS for intranet resources etc. Technically I could override it with a single rule in configuration, but this would also technically be a breach of guidelines as opposed to the more moderate rules-lawyery approach I attempt here.

    If it was up to me the employer should just add some blocklist to their own forwarder for the benefit of everyone working there…

    But guess I’ll settle for local dnsmasq on the laptop for now. Thanks for the discussion 👌🏼


  • TLDR: looks like you’re right, although Chrome shouldn’t be struggling with that amount of hosts to chug through. This ended up being an interesting rabbit hole.

    My home network already uses unbound with proper blocklist configured, but I can’t use the same setup directly with my work computer as the VPN sets it’s own DNS. I can only override this with a local resolver on the work laptop, and I’d really like to get by with just systemd-resolved instead of having to add dnsmasq or similar for this. None of the other tools I use struggle with this setup, as they use the system IP stack.

    Might well be that chromium has a bit more sophisticated a network stack (than just using the system provided libraries), and I remember the docs indicating something about that being the case. In any way, it’s not like the code is (or should be) paging through the whole file every time there’s a query – either it forwards it to another resolver, or does it locally, but in any case there will be a cache. That cache will then end up being those queried domains in order of access, after which having a long /etc/hosts won’t matter. Worst case scenario after paging in the hosts file initially is 3-5 ms (per query) for comparing through the 100k-700k lines before hitting a wall, and that only needs to happen once regardless of where the actual resolving takes place. At a glance chrome net stack should cache queries into the hosts file as well. So at the very least it doesn’t really make sense for it to struggle for 5-10 seconds on every consecutive refresh of the page with a warm DNS cache in memory…

    …or that’s how it should happen. Your comment inspired me to test it a bit more, and lo: after trying out a hosts file with 10 000 000 bogus entries chrome was brought completely to it’s knees. However, that amount of string comparisons is absolutely nothing in practice – Python with its measly linked lists and slow interpreter manages comparing against every row in 300 ms, a crude C implementation manages it in 23 ms (approx. 2 ms with 1 million rows, both a lot more than what I have appended to the hosts file). So the file being long should have nothing to do with it unless there’s something very wrong with the implementation. Comparing against /etc/hosts should be cheap as it doesn’t support wildcard entires – as such the comparisons are just simple 1:1 check against first matching row. I’ll continue investigating and see if there’s a quick change to be made in how the hosts are read in. Fixing this shouldn’t cause any issues for other use cases from what I see.

    For reference, if you want to check the performance for 10 million comparisons on your own hardware:

    #include <stdio.h>
    #include <stdlib.h>
    #include <string.h>
    #include <sys/time.h>
    
    
    int main(void) {
    	struct timeval start_t;
    	struct timeval end_t;
    
    	char **strs = malloc(sizeof(char *) * 10000000);
    	for (int i = 0; i < 10000000; i++) {
    		char *urlbuf = malloc(sizeof(char) * 50);
    		sprintf(urlbuf, "%d.bogus.local", i);
    		strs[i] = urlbuf;
    	}
    
    	printf("Checking comparisons through array of 10M strings.\n");
    	gettimeofday(&start_t, NULL);
    
    	for (int i = 0; i < 10000000; i++) {
    		strcmp(strs[i], "test.url.local");
    	}
    
    	gettimeofday(&end_t, NULL);
    
    	long duration = (end_t.tv_usec - start_t.tv_usec) / 1000;
    	printf("Spent %ld ms on the operation.\n", duration);
    
    	for (int i = 0; i < 10000000; i++) {
    		free(strs[i]);
    	}
    	free(strs);
    }
    



  • And we’ve nowadays taken it even further, in spoken Finnish we’ve even got rid of the “hän” and mostly use “se”, which is the Finnish word for “it”. The same pronoun is used for people in all forms, animals, items, institutions and so on, and in practice the only case for “hän” is people trying to remind others they consider their pets human.

    Context will tell which one it is.