I wrote a calculator that can do decimal and hexadecimal arithmetic conveniently.
https://calc.stuffedcow.net/ (New window)
As a computer engineer, I often do calculations with a mix of decimal and hexadecimal numbers. But I haven’t been able to find a calculator that can work in both bases conveniently. Many calculators don’t support hexadecimal at all (e.g., HP 30b). Some calculators do support it, but it’s buried in layers of menus with no easy way to enter A through F, so it’s much slower than computing in decimal (e.g., HP 42S).
Some software calculators in Programmer mode (e.g., Windows 10, Gnome calculator, KCalc) are better, but have some weird constraints. For example, both Windows calculator and KCalc only allow integers in Programmer mode, and Gnome calculator does not allow switching bases within the same calculation. And, oddly, all three calculators’ programmer modes are missing the typical scientific calculator functions. Programmers don’t science?
So I wrote yet another calculator.
The primary goal is to be able to switch between hexadecimal and decimal quickly while using keyboard (numeric keypad) input. There’s also a variant for touch devices that has more buttons on the screen and relies less on Shift and Ctrl modifier keys.
The calculator has both RPN and infix modes. I prefer RPN, but I suspect most people are more familiar with infix. Not surprisingly, writing the infix mode is many times harder than RPN. RPN mode simply performs each operation on the first two numbers on the stack immediately when an operation button is pressed. Infix notation needs to deal with operator precedence, operator associativity (right-associative vs. left-associative), unary vs. binary operators, and incrementally building and evaluating a syntax tree, along with a way to edit this tree to implement a backspace key. Some calculators even let you edit in the middle of an expression, but mine only allows deleting from the end.
For the GPU architects out there who stare at IEEE 754 numbers all day, there are buttons to convert the internal floating-point value to and from 32-bit or 64-bit IEEE 754 numbers.
The user interface is HTML, CSS, and JavaScript. The calculator itself is written in C++ and compiled to WebAssembly, with the GNU MPFR (multiple-precision floating point with correct rounding) library actually doing most of the arithmetic operations. All calculations are done client-side in the web browser.
I don’t suppose there’s any chance of a locally compilable version? I’m probably old-fashioned, but I’m not sure I like the embrowserification of all computing.
On the other hand, I perfectly understand if you don’t want to maintain a version of the UI for the web and for every platform that every well meaning passerby with a feature request uses.
BTW, I read your thesis with great interest. Are you familiar with “Dynamic Register Renaming Through Virtual-Physical Registers” (Monreal-Arnal, et al. 2000)? The scheme discussed in that paper seems relevant to the microarchitecture described in your work: if we delay allocating physical registers to results until those results are actually produced, then we can have a register file bank with a single write port per execution unit without rename being forced to assign an execution unit early.
In addition to the two methods they present for avoiding deadlock under this scheme, I can think of a third, which should be able to be implemented alongside one of the other two: we can have a backing register file with fewer ports (and thus less area per register) and/or write out register values to some canonical address in RAM. If there are not sufficient physical registers in the main register file to allocate one for a new result, a value can be evicted from the main file into the backing file (or to RAM) to free up a register in the main file. If RAM is used, we can always avoid deadlock. If RAM is not used, then the combined size of the main and backing files must be at least the number of virtual-physical registers (but we can still avoid assigning a physical register until execution finishes), or else deadlock is not avoided and one of the deadlock-avoidance methods outlined in the Monreal-Arnal paper must be used as well. The main file can be sized according to performance requirements, and then the backing file can be sized according to deadlock requirements, saving area relative to an implementation with all the registers in the main file.
Of course, if RAM is used, then if we are implementing an existing architecture (such as x86 in your thesis), then the RAM region used to back the registers must be visible only to the microarchitecture, and not at an architectural level, or else there is a potential to be incompatible with code written for existing implementations. A clean slate architecture can reserve an architecturally-visible set of addresses for this purpose without back-compatibility constraints (such as, perhaps, the region beyond the stack pointer).
It is not relevant to an architecture as register-poor as x86-32, but an architecture such as Knuth’s MMIX with a very high number of logical registers could also benefit from this setup, as the number of full-ported physical registers (or total physical registers, if RAM is used as backing store) can actually be less than the number of logical registers, as long as it is sufficient to meet the target IPC of the microarchitecture.
Hello!
I, too, do not like the embrowserification of everything. But it’s also very hard to argue with being able to run cross-platform with little effort for both the developer and the user. I personally use Linux at home, Windows at work, and a smartphone, so I’d need to create three versions just for myself. Which graphics API would I use? Other than something like Electron (basically bundle a browser with the application), rewriting the UI sounds like a lot of effort. (Electron probably counts as more embrowserification though.)
I still try to resist JavaScript though: I try to write my web apps in C++ (compiled to WebAssembly), though I’ve found a lot of the UI code still ends up being JavaScript.
I’m somewhat aware of the general idea of delaying register file allocation until later in the pipeline to save on register file entries, but can’t remember whether I’ve come across specifically the paper you mentioned.
Reducing register file size is good, but I’m more horrified by the extra logic needed to do a second level of renaming. I think the “backing register file” requires yet another type of renaming beyond what’s proposed in the paper. You’d need to track the location of all those register values as they migrate to and from the backing storage.
The remapping table looks a lot like the register file itself (number of entries and number of read/write ports) except with fewer bits per entry. But this is still a large number of new read/write ports. There’s also a lot of complexity involved in broadcasting values (the new virtual -> physical mapping when a register is allocated). Every execution unit can generate a new mapping every cycle, and that mapping needs to go to every instruction queue entry on every execution unit, and that’s very challenging to do in a cycle (does it ~double the cost of instruction wakeup?).
There’s also an assumption that the GMT (general map table) gets to see the new register allocations as they happen, but the execution units are usually so far away from the register renaming logic that it takes more than one cycle to even send a signal from scheduling/execution back to the remapping table. So now you need a mechanism to deal with instructions that already passed the GMT before seeing the new remapping, but hadn’t yet reached the instruction queues when the broadcast happened there and missed the broadcast.
In comparison, oversizing the register file feels like a smaller price to pay. Moore’s Law has made wasting a bit of area more tolerable, and the amount of a memory array you can access grows fairly quickly with increased access time, so the penalty you pay for oversizing the register file is often no more than an extra pipeline stage and a little power and area, and this cost is isolated to a small part of the processor (i.e., less complexity than several blobs of logic added to several places around the core that all need to talk to each other).
I’m at Intel working on big out-of-order processors, so things like “broadcast” and “table that every instruction needs to query/update” scare me a lot more than they used to.
Admittedly, I’m an interested layman, not someone with a career or formal education in the field, so while I have a sense of what things are expensive in area/power/delay, I don’t have a great grasp on actual “how much more expensive” numbers.
Reducing register file size is good, but I’m more horrified by the extra logic needed to do a second level of renaming. I think the “backing register file” requires yet another type of renaming beyond what’s proposed in the paper. You’d need to track the location of all those register values as they migrate to and from the backing storage.
I believe the only extra thing the main renamer absolutely needs to track is whether the committed value of a logical register (or the value of virtual-physical register, Frododepending in implementation details) is present in the main register file (one bit, if the main register file doesn’t have a power-of-two size, we can just use a special register number to indicate not-present and don’t even need to add a bit). We’re sizing the main file for performance, and the logic for the backing store does not need to be so performant, it just needs to avoid deadlock when the main file is exhausted. It needs to accept and store values that the main renamer evicts, and it needs to supply those values when the main renamer requests them again, but these do not necessarily have to be done same-cycle, as long as the main file is large enough that such stalls are reasonably rare.
Other than that, any information shared between the main renamer and the backing store logic is simply a performance optimization (for example, they might coordinate on using unused read bandwidth from the main file to pre-flush main file values to the backing store before they’re actually evicted so that evicting pre-flushed registers only requires setting their present bit to indicate they aren’t in the main file). The backing store logic should otherwise be able to be a black box to the main renamer. Whether it does its own renaming internally, or just has enough registers to back every logical or virtual-physical register (according to implementation details) without renaming (but at a lower port count than the main file), or whether it just dumps everything to a canonical address in RAM without having any on-chip storage at all (other than the general data caches) is irrelevant to the main renamer.
The remapping table looks a lot like the register file itself (number of entries and number of read/write ports) except with fewer bits per entry. But this is still a large number of new read/write ports. There’s also a lot of complexity involved in broadcasting values (the new virtual -> physical mapping when a register is allocated). Every execution unit can generate a new mapping every cycle, and that mapping needs to go to every instruction queue entry on every execution unit, and that’s very challenging to do in a cycle (does it ~double the cost of instruction wakeup?).
Since we’re generating the mapping at instruction completion, and completion of the instruction has to be broadcast anyways, can we amortize the cost by including the mapping in that broadcast, or is the cost proportional to the width of the broadcast network? (Even in the latter case, can we use the mapping to replace other data that would be in the broadcast traditionally?)
While we do need to broadcast the renames that do happen, the renamer doesn’t need to generate more renames per cycle than it would with an equivalent traditional core, even if there are more execution units than renamer width: if the completion of an instruction is delayed by inadequate renamer bandwidth, it would have been equivalently delayed in the front end on a traditional microarchitecture.
In comparison, oversizing the register file feels like a smaller price to pay. Moore’s Law has made wasting a bit of area more tolerable, and the amount of a memory array you can access grows fairly quickly with increased access time, so the penalty you pay for oversizing the register file is often no more than an extra pipeline stage and a little power and area, and this cost is isolated to a small part of the processor (i.e., less complexity than several blobs of logic added to several places around the core that all need to talk to each other).
A lot of my musings on this concept have assumed a clean-slate architecture with a large number of logical registers, and either small implementations of such an architecture with many fewer physical than logical registers (in which case some sort of backing store arrangement is needed in any case), or implementations with high-order SMT, where providing a large enough register file to back every committed register for every thread (let alone accommodate any speculative values) might be prohibitive. Consider an implementation of Knuth’s MMIX with 8-way SMT, for example.
I also recall reading an article recently that mentioned that recent nodes have had more difficult shrinking SRAMs than logic. I assume this means that the relative size of addressing logic and actual storage elements for register files will change, which I presume will affect the balance of when things like CAMs and highly multiported register files are worthwhile?
Often, the cost of a mechanism isn’t really in the extra bits needed to store state, but how difficult/expensive it is to correctly update that state, and whether updating the state can be done fast enough to fit into the timing budget (Does it fit in one 4-6 GHz clock cycle? If not, can it be pipelined over multiple cycles? How much extra hardware do you need to add if it’s pipelined?)
And on top of that: What happens when you get a pipeline flush (at the most inconvenient location, of course)? How do you correctly undo all of the state changes back to the point of the pipeline flush, and do it within a few cycles?
Tracking whether a virtual register is located in the “main” register file: I don’t think one bit is enough. That bit only tells me where my logical register is not, it doesn’t tell me where it is.
Broadcasting new mappings: The wakeup broadcast network is extremely expensive (every execution port talks to every scheduler entry every cycle), so there aren’t going to be unused broadcasted bits to repurpose. Doing a renaming at instruction execution time at minimum doubles the cost of the broadcast network (“Hi, I’m going to write back virtual register V, and V’s new mapping is physical register P“)
(If your processor happens to use a matrix-style instruction scheduler, the broadcast is even weirder: “Hi, I’m instruction scheduler entry I”. No register numbers are broadcasted, so both V and P are new bits you need to broadcast, and add several hundred comparators to the instruction scheduler to listen for V being broadcasted.)
“the renamer doesn’t need to generate more renames per cycle than it would with an equivalent traditional core, even if there are more execution units than renamer width”
Unfortunately, looking only at average required throughput often isn’t sufficient. One challenging aspect of building less hardware than the worst-case behaviour is that you need to invent a mechanism to handle the worst case. “Just add a cycle” is often extremely difficult to do.
For example, if you executed 10 instructions in some cycle, but only built enough bandwidth for 9 (Let’s ignore the fact that arbitration and muxes aren’t cheap). What happens to the instruction that can’t broadcast its mapping? If you stall it by a cycle, it may collide with another instruction coming down the same pipeline in the next cycle. Also, what happens to an instruction that expects your broadcast to happen on time? And what about the consumers of that instruction, and their consumers, and so on? By the time you realize an instruction needs a one-cycle stall, you already have several cycles of instructions scheduled behind it (20-50 instructions) that relies on your instruction completing on time. What do you do?
* Broadcast a stall signal to stall everyone for a cycle? (Can you do this within a cycle?)
* Squash all 20-50 instructions and redo them all? (Ouch!)
* Invent a mechanism to compute which of the 20-50 instructions truly depend (directly or indirectly) on the delayed instruction and only squash those? (How expensive is this circuit?)
And do the muxes/arbitration that allow you to build less than the worst-case cost more than actually just building for the worst case?
Keep in mind that an alternative design to all of this complexity and the occasional stall is to simply build a traditional physical register file that’s smaller than you’d like. The performance you’d give up by doing this is fairly small (I’d guess single-digit percent or less).
A large number of logical registers isn’t always a good thing. There are diminishing returns to having more registers, but you have to pay for it in the size of the register renaming table, the number of bits in the instruction opcode to encode so many registers, and time to save/restore so many registers on a context switch. It makes more sense on an architecture that doesn’t have a renaming table and has plenty of opcode bits (i.e., Itanium. No renaming, 41-bit opcodes, and 128 logical registers), but then I’d question whether the architecture itself makes much sense 😛
The case of many thread contexts is interesting, but chances are, if you build many threads, you intend on using all of them, so there won’t be that many unused physical registers for you to offload to slower/bigger storage. GPUs have this exact problem, and they build huge register files (megabytes) as a result (Not many logical registers per thread, but a huge number of threads).
One thing to keep in mind is that the reason for bringing the concept up here is that I thought it particularly relevant to your thesis work: namely a relatively narrow OoO machine on an FPGA where implementation details used to save space on the FPGA constrain the selection of execution units if rename is done in the front end. You’re talking about
6 GHz 10-wide machines, and I think it could be useful at that end of things, but the “constraints on front end rename on an FPGA” scenario seems like it would be where the concept would have a home field advantage.
(If your processor happens to use a matrix-style instruction scheduler, the broadcast is even weirder: “Hi, I’m instruction scheduler entry I”. No register numbers are broadcasted, so both V and P are new bits you need to broadcast, and add several hundred comparators to the instruction scheduler to listen for V being broadcasted.)
V is just an arbitrary tag, so why can’t “I” be a component of that tag? It can’t be the whole tag, because you need to disambiguate between instructions that occupy the scheduler entry at different times, but it can probably be most of it, and now most of V is *not* new bits added to the broadcast.
What happens to the instruction that can’t broadcast its mapping?
For instructions that take a determinate number of cycles to execute, you know at issue time when it will complete, so the most obvious solution is not to issue it in the first place if the instructions already issued for completion that cycle already use all the broadcast bandwidth. You could also use the rename at issue variant of the virtual register scheme, at the expense of some added register pressure.
Tracking whether a virtual register is located in the “main” register file: I don’t think one bit is enough. That bit only tells me where my logical register is not, it doesn’t tell me where it is.
All the main renamer needs to know is that the logical register isn’t within its own physical register file. If the logical register misses in the main file, the main file asks the backing file to retrieve the value. When it receives the value from the backing file, it places it into one of its own physical registers and proceeds.
The renamer for the backing file (assuming the backing file is renamed, which is implementation dependent) keeps track of the exact location of the logical register, but the backing file’s rename table doesn’t need to have the port count or performance of the main rename table, so it should be less expensive than storing the same information in the main rename table.
For the other points, I should probably go into depth on the type of ISA I’ve been mulling over. I’ll try to do that some time in the near future.