31st May 2019

10:08am: This blog
I'm writing the blog in English because I type twice as fast this way than in Russian (yes I am improving...), and 90% of my friends and friends can read English any way. I answer comments in the same language they are posted.

I am trying not to post anything related to my employer, but if I do so, this expresses only my own views and does not represent an official position of my employer.

When I post about some technical topic which seems non trivial and is related to my employer's product, don't expect it to be inside information. If I post about it, it means that this info is already public. Usually I do not post any personal information or anything that is related to my family.

All photos are mine, and I allow anyone to copy, change, do anything you please with them. I don't post "friends only".

Useful tags are: software , idea , Deutschland, readlog, freediving and trip report

16th May 2018

10:10am: Optimizing a kernel with low AI
(Where kernel has nothing to do with operating systems, and AI has nothing to do with artificial intelligence)
I got a compute kernel from a customer that was not performing up to expectations. It turned out that AI( as in a roofline model in the kernel was very low, so performance should be limited by memory throughput at a given data locality. (so floating point operations in AVX512 were so simple that getting data to reigster and writing back to memory was the slowest part).

For such workloads, a natural performance metric is GB/sec at given data size. So in theory, when all data fits L1, it should run at ~350GB/sec, 130SB/sec from L2, 19GB/sec from L3, and ~14GB/sec from RAM. (Skylake server I am running on)

Here is what I measured when I added a second stream of processing to the compute loop:

It seems that running 2x streams of processing in single loop improves it a lot, but only when most of data is in L3. I am not sure why at L3 and 2x streams it runs faster than L3 peak throughput ...

There are many factors: frequency goes down with AVX512, h/w prefetchers, TLB, etc.

14th May 2018

8:16pm: Very useful tool for x86 assembly intrinsics
IntrinsicsGuide lets user search by intrinsic name, instruction name. All SSE/AVX instructions, including some that are not yet released, includes scalar supplementary instructions like prefetches, scalar bit manipulation, etc.

When looking for specific instruction, I always use this tool rather than x86 architecture guide.

12th May 2018

5:10pm: Jacob and transhumanism
Jacob, 3y10m in a car:
- Папа, все животные умирают от старости?
- Все.
- Тогда вы с мамой тоже состаритесь и умрете.
- Обязательно, но не очень скоро.
- И я тоже?
- И ты.
- А Frau N не умрет - она заведующая детского садика! А можно что-нибудь сделать, чтобы тоже не умирать?
- Пока нет, но ученые над этим работают. Может быть, найдут, как отключать ген старения.
Next day:
- Папа, я хочу стать ученым, найти Гену, чтобы не умирать! Сделать это для вас с мамой, и для меня.
- А для остальных?
- Нет, пускай все умрут, а мы втроем останемся!

2nd May 2018

9:30am: A dream about Zuckerberg
Today I had a weird dream, again.

I was watching a live feed from F8'2019, and here is what I saw:

First, Mark, dressed in his usual t-shirt enters the stage and tells he has an announcement. The announcement is that rumors about humanoid robots Facebook is developing are only partially true. It is not robots, they are avatars. 8 people that look like Mark's identical twins enter the stage. One of them is a woman, one wears Mark's t-shirt, other 6 wear different smart casual outfits. They sit around the desk, and Mark continues: Facebook knows everything about the users, so it can predict any reaction with high accuracy. So an Avatar can replace any person in boring things like work. Himself, he is now skipping half of the meetings and sends an avatar himself.

Software for an avatar runs in Facebook cloud, and it takes approximately 8 Xeon-D platforms to power an active avatar.

Then an avatar in Mark's t-shirt stands up and tells: "You see, the avatar is so good so he could start a presentation for me today."

30th April 2018

10:42am: Missing real estate bubble start
One of the reasons I decided to move to Munich 9 years ago, when I had choice of Paris, London and Munich was there was no housing bubble. I missed buying a flat here at decent prices because of many reasons I don't want to go into here. Now the affordability approaches records set by Sydney and Hongkong, so I decided not to bother.


Today I found a very interesting offer - a flat located very close to center of Munich (at a center of my favorite neighborhood, close to all amenities), everything perfect, ~50% !!! discount at a market price! Drawback - the house is big and 90% of inhabitants are receiving social benefits, many immigrants (but as an immigrant myself I don't mind at all).

I thought the discount in this case should have been 10-20%, so I am probably missing another important factor. What else could it be?

21st April 2018

8:58pm: What is that?

I see these things often in parks, usually in least accessible places, but close to the ground. No idea what are they.

What could it be?

16th April 2018

2:04pm: hard things in computer science
Now at work I learn the hard way that the saying "There are only two hard problems in Computer Science: cache invalidation and naming things" is so true.

And I mean cache invalidation, or rather cache replacement. I am running a STREAM-scale like u-benchmark on a Skylake server, performance is at 14.3-17.1GB/sec, and this ~20% variation of performance is driven by very obscure factors..

4th April 2018

9:36am: Great product
If anyone is thinking about upgrading a PC here is a nice option:
latest NUC. It costs like a mid-range gaming PC, performs almost like a top-end gaming PC, can handle any VR, is tiny and very quiet. See the review link for details.

A very balanced platform, great work from my colleagues who work on consumer platforms.

31st March 2018

10:10pm: Summer holiday
Just booked a week in a hotel in Montenegro. As usual I read tripadvisor reviews, and I only found one with 1 star, probably from a British guest:

"Firstly I have to say that the rooms are adequate and very clean. .... hotel is good ... BUT!: why the terrible review score? One evening I saw a man wearing a T-shirt which had written on it (back and front) the words in English which were INCREDIBLY DISGUSTING.Both my wife and I , as well as other English speaking guests, found this situation EXTREMELY offensive."
(might be a guest from Russia?). Then the guy explains how he chased everyone at the hotel to make another guest change his T-shirt.

"To add insult to injury we were then, and only then, offered a free meal at the beach restaurant which we refused. The manager had directed staff to refuse our payment so we gave the waiter a 10 Euros tip for a ham bruschetta and two cold drinks which he immediately pocketed.
This obnoxious incident left us totally staggered by the response of the staff so if you wish to be treated in much the same way then be my guest."

Response from management: "... I can assure you that after the second time that you addressed the management, immediate steps were taken, and the guest in question was asked to change in something different. It was at that moment, on his way out of the restaurant, after we asked the guest to change, that you encountered him for the 3rd time. ..."

I am wondering what were the words on the T-shirt..

And what should I wear when I come there this summer :)

18th March 2018

8:51am: It was just a dream
I hailed a taxi to get to the Munich train station from home. After we arrived, the meter was only showing 2.32. I told in German: "Why 2.32, it should be around 12 euros?". A driver looked back on me, and I noticed it is Putin. Then Putin-taxi-driver said in Russian: "Young man, who needs your euros, it is 2 roubles 32 kopeeks".

It was just a dream, right before the elections..

16th March 2018

7:52pm: Jacob speaks
3y9m. "Если вы не aufräumen, то everybody should clean it up!"

15th March 2018

2:41pm: sq full
if a function gets too many l3 hit and sq full events, maybe it is not bound by l3 throughput, maybe some [...] had disabled L2 prefetchers in BIOS...

25th February 2018

7:11pm: Customers
I have a good customer, I really like the guy and especially the code we optimize together. But sometimes he complains about the compiler:

1. The compiler is stupid! The loop is so simple, and yet compiler vectorized it much worse than my intrinsics did.
2. The compiler is too smart! I made an intrinsics version of the loop, and it is only 5% faster than compiler generated.


24th February 2018

10:23am: Lufthansa Munich robot
I just interacted with this lady:

Asked her 4 questions about my flight, gate and terminal. 3 questions she did not understand, and one she misunderstood. The main problem was acoustics: instead of placing a decent mic array like in Alexa or Google device, they put there something that looks like an arduino microphone! No, really! The robot does nto work on it's own: there are two ladies helping. They told me that I should speak to the mic, but when I am too far it does not hear me, and when I am too close, there are acoustic distortions because of that cheap mic.

No way these devices will conquer us! (or take our jobs :))

22nd February 2018

9:42am: Studying to be a programmer
I studied physics in university, and I only had 3 programming related courses: numeric methods in Fortran, Operating systems, and CS 101 (finite automata, Turing machines, etc).

During last ~19 years working as a programmer full time I learned some tricks of our trade. But now I am thinking what would it take me today, if I was 18 y.o. and if I wanted to get just enough CS theory to understand how things work without formally studying CS.

I think just four books is enough: Aho&Ulman (compilers), Kormen(algorithms), Henessy&Patterson(hardware), and Tannenbaum (OS and networks). Am I missing something?

Upd: based on comments: Mao(crypto), Brooks(projects)

16th February 2018

10:44am: Unluck: 1/200000
Today a chain of unlikely events made me come to work an hour late.
I was late and missed my train: happens twice a year, ~1/100
the next suburban train was cancelled: happens every 2 months, ~1/50
I switched to u-bahn+bus, and misses u-bahn by ~1 second, ~1/5 (they come often)
I missed my bus by ~1 second (~1/4*1/2=1/8)
total 1/100*50*5*8=1/200000

Should I start buying lottery tickets or even buy a bitcoin?

15th February 2018

2:45pm: ЗападлО
This time in Helsinki as always I read many signs translated to English, Swedish and Russian. I liked this one:

Have you noticed that the fine print in English and Finnish mentions a fine of 350 euros, but Russian does not? Also I would not translate "NO SMOKING" as "место для курения"...

Another thing: In Helsinki airport, I always went to the station and bought ticket to Helsinki on a local train. When I tried to do it this time, the machine told me it no longer sells tickets to Helsinki, only to other cities. (That was a software upgrade..). The only way is to get back to the airport's ticket machine (5 minutes one way), and miss a train.

P.S. Still like Espoo, new swimming pool was very nice, and very crowded at 6AM. Parking lot was full, it is good that I walked there from my hotel.

13th February 2018

12:00pm: Whos BMW is that?
Right at a bicycle parking lot near our office someone had parked a brand new BMW cabriolet, and put empty pack of cigarettes and a bottle of liqueur. It gets covered with snow, such a beautiful car.. Alas, there are no license plates so we can't find the owner..
Anyone to claim?

8th February 2018

9:21pm: Debugging...
I will probably find why that happens tomorrow, but here it is:
(gdb) n
631 __m256i C = _mm256_mul_epu32(A, B);
(gdb) p/x A.m256i_i32
$1 = {0x1,  0x1, 0x1, 0x1,  0x1, 0x1, 0x1, 0x1}
(gdb) p/x B.m256i_i32
$2 = {0xff, 0x0, 0x0, 0xff, 0x0, 0x0, 0x0, 0xff}
(gdb) n
(gdb) p/x C.m256i_i32
$3 = {0xff, 0x0, 0x0, 0x0,  0x0, 0x0, 0x0,  0x0}


While posting, I RTFM'd, and found what is wrong. Very counter-intuitive, but above is expected!

29th January 2018

10:07am: Kickstarter tip
I just got a "superbacker" badge there (nothing to be proud about, spent too much $ on mostly useless crap that seemed cool at the moment.)

Here is a tip: never use "remind me" button!

Rationale: 90% of the projects are late. 80% of those are late by 0.5-2 years. 100% of the projects that start shipping 1 year late, make shipping for another year, sorted by a backing number... So if you checked "remind me" button, and then backed after a reminder, you are guaranteed to pay at the very same time with folks who backed the project at the time you've hit that bloody button, but you'll get your reward up to 1 year later (if ever).

P.S. Waiting for my Spectrum next now, most likely it will be late by just 2-3 months.

27th January 2018

2:32pm: Jacob draws
Jacob is 3 years 7 months old. Today's drawing:

Today's saying: "Дед мороз - это просто сказочный новогодний мужик."

20th January 2018

1:34pm: School rules
Looking around for a primary school. Here is what I found in the first link at one of the school's homepage:
We come to the class on time. In the school building we behave quietly. We do not run in the hallway and on the stairs. We go on the right side and pay attention to others. We keep our toilets clean and follow our toilet rules. We take a quick break and stay in the schoolyard. We observe the quiet zone. We throw our garbage in the waste container. We handle the plants carefully and do not run or play in the school garden. We do not throw snowballs. 
Consequences of offence: verbal / written apology, ..."

Just wondering how they make 6-7 years olds observe all these rules?

16th January 2018

10:39am: Compiler is too smart
A hotspot inside a nested loop:
__m128i result = _mm_add_epi32(a, b);
return ((uint16_t*)&result)[1];

This compiles to:

Blocked store forwarding, CPI 3+, too slow.
What if I try to get rid of the long store forward block penalty using an obvious workaround:
return ((uint32_t*)&result)[0]>>16;

Then compiler will still generate:

Good compiler, smart! It won't let me do

no matter what, even above as inline assembly it does not like.

11th January 2018

9:40am: A very interesting s/w performance optimiztion case.
Here is a simple pseudocode:
function f(array_A)
  SoA = transform1(A);
  for i=0 to 4
    array_B=load_array_from_SoA(SoA, i);
  return transform4(array_D);

SoA is a structure of arrays.
transform functions take an array, make a lot of calculations, return transformed array.
Baseline performance: let's say 200 seconds.
