Skip to content

Conversation

@mdboom
Copy link
Contributor

@mdboommdboom commented Sep 26, 2023

This just implements a set of tier 2 pystats mentioned in #109329.

This implements:

  • Total micro-ops executed
  • Total number of traces started
  • Total number of traces created
  • Optimization attempts
  • Per uop execution counts, like we have for tier 1 instructions.
  • A histogram of uops executed per trace
  • Trace too long
  • Unsupported opcode (maybe even with counters for each offender)
  • Inner loop found
  • Too many frame pushes (currently capped at a depth of 5)
  • Too many frame pops (if we return from the original frame)
  • Recursive function call

It also fixes a "bug" where specialization "hits" in the running Tier 2 interpreter would count against the Tier 1 stats.

Not implemented (since it will require reworking of DEOPT_IF calls):

  • Exit reason counts: polymorphism vs branch mis-prediction

Example output (for nbody benchmark):

Optimization (Tier 2) stats

statistics about the Tier 2 optimizer

Overall stats

overall stats
CountRatio
Optimization attempts5
Traces created5100.0%
Traces executed1,600,070
Uops executed65,202,62140
Trace stack overflow0
Trace stack underflow0
Trace too long5
Inner loop found0
Recursive call0

Trace length histogram

RangeCountRatio
<= 400.0%
<= 800.0%
<= 1600.0%
<= 3200.0%
<= 645100.0%

Optimized trace length histogram

RangeCountRatio
<= 400.0%
<= 800.0%
<= 1600.0%
<= 3200.0%
<= 645100.0%

Trace run length histogram

RangeCountRatio
<= 400.0%
<= 800.0%
<= 1600.0%
<= 3200.0%
<= 641,600,070100.0%

Uop stats

uop stats
UopCountSelfCumulative
STORE_FAST14,900,59822.9%22.9%
_SET_IP14,100,58321.6%44.5%
LOAD_FAST10,100,51315.5%60.0%
_GUARD_BOTH_FLOAT3,700,2745.7%65.6%
_BINARY_OP_SUBTRACT_FLOAT2,900,1384.4%70.1%
UNPACK_SEQUENCE_TUPLE2,400,0923.7%73.8%
UNPACK_SEQUENCE_LIST2,400,0923.7%77.5%
_POP_JUMP_IF_TRUE1,700,0492.6%80.1%
_EXIT_TRACE1,600,0702.5%82.5%
_ITER_CHECK_LIST1,600,0652.5%85.0%
_IS_ITER_EXHAUSTED_LIST1,600,0652.5%87.4%
COPY1,599,9482.5%89.9%
_ITER_NEXT_LIST1,400,0532.1%92.0%
UNPACK_SEQUENCE_TWO_TUPLE1,000,0391.5%93.6%
LOAD_CONST800,0011.2%94.8%
SWAP799,9741.2%96.0%
BINARY_SUBSCR_LIST_INT799,9741.2%97.2%
_BINARY_OP_MULTIPLY_FLOAT400,0950.6%97.9%
_BINARY_OP_ADD_FLOAT400,0410.6%98.5%
STORE_SUBSCR_LIST_INT399,9870.6%99.1%
POP_TOP200,0170.3%99.4%
_ITER_CHECK_RANGE99,9840.2%99.5%
_IS_ITER_EXHAUSTED_RANGE99,9840.2%99.7%
_ITER_NEXT_RANGE99,9790.2%99.8%
GET_ITER99,9790.2%100.0%
BINARY_OP270.0%100.0%

Unsupported opcodes

unsupported opcodes
OpcodeCount

Copy link
Member

@brandtbucherbrandtbucher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is so great to have, thanks! We can tweak the collection and output further if need be, but this is already going to be so much more insightful!

Just a couple of suggestions, but otherwise I'll plan on merging tomorrow.

#define OPT_STAT_INC(name) do{if (_Py_stats) _Py_stats->optimization_stats.name++} while (0)
#define UOP_EXE_INC(opname) do{if (_Py_stats) _Py_stats->optimization_stats.opcode[opname].execution_count++} while (0)
#define OPT_UNSUPPORTED_OPCODE(opname) do{if (_Py_stats) _Py_stats->optimization_stats.unsupported_opcode[opname]++} while (0)
#define OPT_HIST(length, name) do{if (_Py_stats) _Py_stats->optimization_stats.name[_Py_bit_length((length - 1) >> 2)]++} while (0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why ((length - 1) >> 2)?

Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand correctly, there is a minimum trace size of 2, so the first two power-of-two buckets (for 0 and 1) would always be empty, so might as well just not support them (that's the >> 2). Then I also think it's easier to reason about buckets that have an inclusive maximum of a round number (that's the - 1).

uint64_t execution_count;
} UOpStats;

#define _Py_UOP_HIST_SIZE 5
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll probably want more buckets than this (since execution lengths for closed loops will get pretty big), but it's probably fine for now. We can just bump it whenever.

Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought the max trace length was 64 (_Py_UOP_MAX_TRACE_LENGTH). Can they be longer? (It's quite possible I don't understand the code). In any event, I should probably add a comment here that this should be log2 of whatever the expected maximum length is.

Copy link
Member

@brandtbucherbrandtbucherSep 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can they be longer? (It's quite possible I don't understand the code).

Not statically, but dynamically they can be quite long (imagine executing a trace for a closed loop hundreds of times).

So we should probably modify the code that adds to the histogram to be Py_MAX(_Py_UOP_HIST_SIZE - 1, _Py_bit_length((length - 1) >> _Py_UOP_HIST_BIAS)), so that anything overflowing the histogram ends up in the largest bucket instead of writing past the end of the array.

Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that's a good idea to add the range-checking.

Also, this made me realize I was counting the wrong thing -- I was recording the "instruction pointer at exit" to be the execution length. But (after discussing in person) it seems what we want is the number of uops executed per execution of the trace. I've updated this to report that instead.

print_histogram(FILE *out, const char *name, uint64_t hist[_Py_UOP_HIST_SIZE])
{
for (int i = 0; i < _Py_UOP_HIST_SIZE; i++){
fprintf(out, "%s[%d]: %" PRIu64 "\n", name, (1 << (i + 2)), hist[i]);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why i + 2? Is it to correct for the >> 2 when recording the stats?

Oh, I think I see. We're starting with the "third" bucket, since tiny traces are basically nonexistent. Maybe add another constant next to _Py_UOP_HIST_SIZE (like #define _Py_UOP_HIST_BIAS 2 or something) so it's clearer why we're doing this and where.

Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's right. Adding the #define for bias makes sense.

if not key.startswith(prefix):
continue
name, _, rest = key[7:].partition("]")
name, _, rest = key[len(prefix) + 2:].partition("]")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this correct? The old value was 7, but len("opcode") + 2 is 8.

Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. It seems to not matter, but might as well revert it just in case.

def _emit_comparative_execution_counts(base_rows, head_rows):
base_data = dict((x[0], x[1:]) for x in base_rows)
head_data = dict((x[0], x[1:]) for x in head_rows)
opcodes = set(base_data.keys()) | set(head_data.keys())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dictionary keys support set operations:

Suggested change
opcodes=set(base_data.keys())|set(head_data.keys())
opcodes=base_data.keys() |head_data.keys()

Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIL!


rows = []
default = [0, "0.0%", "0.0%", 0]
for opcode in opcodes:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe give these a deterministic (and somewhat meaningful) order?

Suggested change
foropcodeinopcodes:
foropcodeinsorted(opcodes):

Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could, be we end up sorting by the counts a few lines below anyway...

(opcode, base_entry[0], head_entry[0],
f"{100*change:0.1f}%"))

rows.sort(key=lambda x: -abs(percentage_to_float(x[-1])))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
rows.sort(key=lambdax: -abs(percentage_to_float(x[-1])))
rows.sort(key=lambdax: abs(percentage_to_float(x[-1])), reverse=True)

@brandtbucher
Copy link
Member

(Also, it looks like CI caught some trailing whitespace.)

Copy link
Member

@brandtbucherbrandtbucher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, thanks!

@brandtbucherbrandtbucher merged commit e561e98 into python:mainOct 4, 2023
Glyphack pushed a commit to Glyphack/cpython that referenced this pull request Sep 2, 2024
Sign up for freeto join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants

@mdboom@brandtbucher