8-bit bytes
Much of this grows out of the adoption of the 8-bit byte. That became popular with the introduction of the IBM 360 family of computers in 1964. In an issue that year of the IBM Technical Journal, an explanation of the choice was offered:
Character size, 6 vs 4/8: In character size, the fundamental problem is that decimal digits require 4 bits, the alphanumeric characters require 6 bits. Three obvious alternatives were considered - 6 bits for all, with 2 bits wasted on numeric data; 4 bits for digits, 8 for alphanumeric, with 2 bits wasted on alphanumeric; and 4 bits for digits, 6 for alphanumeric, which would require adoption of a 12-bit
module as the minimum addressable element. The 7-bit character, which
incorporated a binary recoding of decimal digit pairs, was also
briefly examined.
The 4/6 approach was rejected because (a) [it] was desired it to have the
versatility and power of manipulating character streams and addressing
individual characters, even in models where decimal arithmetic is not
used, (b) limiting the alphabetic character to 6 bits seemed
short-sighted, and (c) the engineering complexities of this approach
might well cost more than the wasted bits in the character.
The straight-6 approach, used in the IBM 702-7080 and 1401-7010
families, as well as in other manufacturers' systems, had the
advantages of familiar usage, existing I/O equipment, simple
specification field structure, and of commensurability with a 48-bit
floating-point word and a 24-bit instruction field.
The 4/8 approach, used in the IBM 650-7074 family and elsewhere, had
greater coding efficiency, spare bits in the alphabetic set (allowing
the set to grow), and commensurability with a 32/64-bit floating-point
word and a 16-bit instruction field. Most important of these factors
was coding efficiency, which arises from the fact that the use of
numeric data in business records is more than twice as frequent as
alphanumeric. This efficiency implies, for a given hardware
investment, better use of core storage, faster tapes, and more
capacious disks.
Overall, an 8-bit byte allowed a reasonably large character set, by the standards of the time, and also allowed two BCD digits per byte.
The move to byte addressing
The priority in the earliest computer designs was to process numbers as rapidly as possible. A number was typically stored in a machine word, and the desired numerical range determined the size of the word. Instructions were normally a single word, and there was often a single address as part of each instruction. The size of the address field in instructions determined the memory size. The IBM 704/709 is an example; it had a maximum of 4096 words of 36 bits, with six characters per word, each of 6 bits. Addresses are 12 bits.
As the range of uses for computers expanded, handling text data became more and more important. Doing that in a word-addressed machine is cumbersome, at best. A byte-addressed machine allows you to access individual characters easily, but demands a larger address field. At the same time, magnetic core memory allowed building much larger memories than vacuum tubes, electrostatic storage or delay lines.
These developments essentially forced computers to have larger address spaces, and ended the practice of having an address in each instruction.
Larger Data Items
It obviously makes things simpler to have a whole number of bytes per data item. Simplicity at this level is extremely worthwhile, because it's always been important to make a computer run as fast as possible within a limited budget of electronics parts (tubes early on, transistors since then). So two bytes (16 bits) becomes an obvious size.
For larger sizes, there are two factors that show up in the electronics design:
Counting things
Implementing instructions often requires counting through the bytes (or bits) of data items. Using powers of two makes the electronics of those counters simpler. To count through 4 bytes, you need a two-bit counter, which can hold values from 0 to 3. Counting through three bytes still needs a two-bit counter, but one of its values is meaningless and has to be treated as a special case in hardware.
Sending data over a serial line requires counting through the bits of each item, which is another benefit of 8-bit bytes. A 3-bit counter will handle them, without any need for special cases.
The IBM 360 picked 32-bit addresses (although it only allowed 24-bit memory addresses for its first decade), and once that was established, it was far easier to compete with IBM using 8-bit bytes and 32-bit addresses than if you wanted to do something different.
Memory fetches and data alignment
Fetching data from memory is simpler if data items are "aligned". This means that their addresses are a multiple of their size. So for a byte-addressed machine, like the IBM 360, a single byte can be at any address. A two-byte (16-bit) item as "aligned" if it is at an even-numbered address. A four-byte (32-bit) item is aligned if its address is a multiple of 4.
Many computer designs of the 1960s through 1990s had memories that could fetch 4 bytes in one operation, starting from an address that was a multiple of 4. If your data items are aligned, then you're guaranteed to be able to fetch any two- or four-byte item in a single read from memory. If they are not aligned, you sometimes need two fetches. That requires more complexity in the memory access system, to recognise that the operation is misaligned and generate the extra fetch. That complexity, and the extra fetch, slow things down.
Items bigger than four bytes will need two fetches, but life is simpler if your larger items are eight bytes, and aligned on 8-byte boundaries. Then you always need exactly two fetches. If you have 8-byte items that are not aligned, then you need three fetches.
In modern fast systems, fetches are always of complete cache lines, usually 32 or 64 bytes. These are always aligned, and aligned data items that fit inside them always arrive complete.
Quite a few computer designs regard a misaligned fetch as a program bug, and kill programs that execute one. x86-based systems don't do that, but have to pay the complexity price. They do run faster with aligned data, so that is normally used even though it is not compulsory.
24-bit systems
I've used a 24-bit system, an ICL 1900 mainframe. It used 6-bit bytes, four per 24-bit item. Those 6-bit bytes limited it to UPPERCASE text, and 24-bit pointers limited it to 16MB of RAM, which is tiny by today's standards.
A more modern 24-bit system with 8-bit bytes would still be limited to 16MB of easily addressable memory, and would be paying the costs of counters with unwanted states, and memory items that were either misaligned, or wasted a byte of memory for every 24-bit integer. A 32-bit system would be more capable, and can be built very cheaply in today's technology.
Lessons of history
There have ben a couple of influential computer systems that had 32-bit integers and pointers, but used 24-bit addressing. They're the Motorola 68000 and the IBM 360. In both cases, only the lowest 24 bits of an address were used, but addresses were stored in memory in 32 bits.
As those systems were limited to 16MB of RAM, programmers stored other data in the spare 8 bits. And when 16MB of RAM clearly wasn't enough and the designs were expanded to 32-bit addressing, that data stored in spare bits became a serious problem, if it was treated as part of the address.
On the 68000 family, existing programs had to be changed to stop using those no-longer-spare bits. This was most noticeable in the wider computer industry for Macintosh software in the late 1980s, when updating for 68020 compatibility, but the same thing happened on Amiga, and presumably other 68000-based systems.
On the successors of the IBM 360, 24-bit address programs could still be run, as could programs using larger addresses. But only 31 of the potential 32 address bits could be used; an address bit had been sacrificed to let the hardware tell the difference between the two kinds of code.
Everyone who designed a general-purpose architecture with addressing larger than 32 bits knew of those examples, and how much pain they'd caused. So let's look at the choices of address size:
40-bit addressing involves electronic and alignment complexity, and clearly wasn't going to last very long. It only allows addressing 1024GB, and as of 2022, that would already have become a problem for some markets.
48- or 56-bit addressing are about as complex as 40-bit, and while they probably would last rather longer, by the time you've gone this far, you might as well go all the way.
64-bit is simpler to build than 40-, 48- or 56-bit. It will last longer. Its register size matches standard floating-point data sizes. It seems logical.
The first general-purpose post-32-bit microprocessor released was the DEC Alpha in 1992. The project had started in 1988, initially aiming to keep the 32-bit VAX architecture relevant in the long term. The designers rapidly realised that this was impractical, and designed a new architecture, intended to last at least 25 years. They therefore went for 64-bit addressing, to make sure that they didn't run out of address space.
Releasing a competitor to Alpha which wasn't 64-bit would obviously have a marketing problem with "why isn't it 64-bit?" questions. So 64-bit became the consensus. The much newer RISC-V architecture makes some provision for 128-bit addressing, although this has not yet been designed.
An important detail: no current 64-bit processor can actually have 64-bits worth of memory connected to it. None of them have enough address lines. This does not matter. Future implementations can be given more address lines. Programmers have to be discouraged from using the "spare" address bits, but that's practical to do, and operating systems can be designed to reject such usage.