A compiler for "any" programming language can be written in "any" language.
Quite a few compilers throughout history has been written in themselves. Usually, you cannot start out with that (I'll come back to that below): You must write the very first compiler in some other language. Often, that first compiler handles only a small subset of the new language. When developing Pascal, Wirth tried to write this very first complier in Fortran, but gave up: While you can write a compiler in Fortran, it certainly isn't a language well suited for the task. So Wirth changed to assembly for the very first small-subset-Pascal compiler.
(I know of an operating system that was written in Fortran, but most people refuse to believe that!)
Once you've got that subset-Pascal (or whatever language we are talking about) up and running, you program the next compiler version in that subset-Pascal, but now you make a more advanced compiler, maybe for the entire, un-subsetted language. Now you have a full compiler written in itself.
Most likely, that subset-Pascal was so limited that you had to program in less elegant ways to get around the limitations. So maybe you program a third Pascal compiler, but since you have now got a full-featured compiler at your hand, you can program version 3 using all the great new features of your new language.
This process of going from a first (here: programmed in assembly) compiler to the second (programmed in subset-Pascal) to a third (programmed in full-featured Pascal) is referred to as 'bootstrapping'.
I know of one case where a full-featured compiler was written in itself, using all the features of the language, and there never was another compiler involved: The language was even more primitive than K&R C, called 'NPL'. Its developer wrote the NPL compiler in NPL, so he knew very well what an NPL line would compile to. So he started at the top of the NPL compiler source code, and typed into a new file the machine instructions that the compiler should generate for the first line. And for the second line. And for the third ... Down to the last line of the NPL compiler source code. When he ran the compiler source through that program he had just been typing in, he got a new file with same contents that what he had typed in, instruction by instruction. So the compiler worked as expected!
(That guy was slightly crazy: I was once complaining to him about a bug in the OS, which was written in NPL. He dug up the OS source code - this was in the days of hardcopy printout - and found the function I was complaining about. After some grunting and huffing, he spotted the error, and dug out a ball point pen to write a correction into the printout. Did he write the corrected statements in NPL, the language of the printout? No. Did he write it in symbolic assembly code? No. Did he write it as the the octal representation of the binary instruction codes? Yes, with offsets and all as octal values!)
I know of one case where a full-featured compiler was written in itself, using all the features of the language, and there never was another compiler involved: The language was even more primitive than K&R C, called 'NPL'.
Forth is very close to that. And designed specifically like that. The most primitive basics are written (or were written in) assembler. Then some other items are added, written in those primitives. Then more are written which are based on the prior sets. And so it goes.
It's possible to do. GCC is like that where it is a kind of modular approach. You can specify a platform you're compiling for and it'll use the appropriate compiler "module" to target the platform you specify.
An assembler can be written the same way, targeting a certain chipset, but even then, there are differences inside the same chip, called a "stepping" in modern parlance, where there are bug fixes in the hardware that an assembler has to take into account to generate correct code.
For example, the original 1975 production 6502 CPU had three shift/rotate instructions, ASL, LSR, and ROR. The ROR instruction was bugged where it didn't work correctly[^]. If the assembler didn't know about the bug and generated code that assumed the instruction was correct, your code would probably crash the system. The assembler had to know about the bug and generate equivalent code that didn't use the instruction.
The bug was fixed starting with the 1976 production.
The bug was fixed starting with the 1976 production.
Sidetracking just a little bit, but I can't resist :
At CERN, the same thing happened with two completely independent computer families: First, CERN had bought the VAX 780, the very first VAX. When DEC announced the smaller VAX 750, they proudly announced that the 780 bug that made one instruction give rather unexpected results (sorry, I don't remember which instruction), CERN immediately stood up in protest: No! We have written our software to specifically compensate for that bug! If you change it in the 750, so that it gives a different, 'correct' result, we must update our software for that, and we have to maintain different program versions for the 780 and the 750. That is out of the question, and we will not buy any VAX 750.
The bug was not fixed in the VAX 750.
And the story repeats: CERN had used NORD-10 computers for process control (much due to extremely good interrupt handling: The first instruction of the handler was running 900 ns after the arrival of the interrupt signal, which was super-fast in the mid-1970s). Then comes a complete reimplementation of the architecture, labeled ND-100, with a similar happy message: Finally, we have fixed the bug that has been with the NORD-10 since its introduction ... CERN reacted in the same way: If you fix that bug, we can no longer run our software on your new machines; we have adapted to that bug! We will not buy any ND-100!
So the bug from NORD-10 was retained in the ND-100 - just like with the VAX.
Usually, a machine instruction is a word of, say, 32 bits. The first 6 to 8 bits (typically) tell what this instruction does: Move data, jump to somewhere, add two values etc. The next few bits may indicate how you go about finding the operand, i.e. what to move, where to jump or which value to add. The interpretation of the following bits depend on those (often called the 'address mode' bits): Either as a register number, how far to jump, or the value to add. If the address mode bits so says, maybe the value to add is not in the instruction itself, but the instruction tells in which register you can find the address of the value to add.
The compiler breaks down your code in more primitive operations, such as "add the value of X", without being concerned about what the proper machine instruction will look like. Not until it gets down to the very bottom, the 'code generator'. A compiler may provide different code generators for different machine types. A given code generator knows what an 'add' instruction looks like on x64 processors, knows the proper address mode bits and how to put the address of X into the instruction. Another code generator, for ARM, knows another code for ADD on the ARM, but it also knows that you cannot directly add something in memory. First the code generator must look up an unused register (and if there is none, it must generate an instruction to flush the value in one register back to memory to free it up), then generate an instruction to load X into the free register, and then generate an instruction to do the actual add of the newly loaded value.
There is no common, standardized code for neither move, jump nor add on different machines. The code generator knows the codes for this machine. You may tell the compiler to switch to another code generator knowing the codes for another machine (that is commonly referred to as 'cross compiling'), but the program that comes out of it cannot be run on this machine; you must copy it to another machine of the right type to run it.
In the good old days, there were dozens of different machine types out there, each with its own instruction codes. The last 35 years or so, the 'x86' has pushed most others out. Every PC in the world can understand x86 codes.
But x86 is for 32 bit machines! 64-bit PCs understands 'x64' codes (and address modes), which are different! If you want that move, jump or add to work on a 64-bit PC, your compiler must select a code generator knowing the proper codes for x64. The program will not work on an old x86 PC.
Relax ... The x64 CPUs can be told to forget the x64 codes, and run the x86 instead. The .exe file tells which instruction set should be activated, so you can run 35 year old programs on your brand new 64 bits PC (provided that your current OS will honor all the requests made by that old program, which is not guaranteed - but 20-25 years is probably on the safe side).
Then, what about your smartphone - will it know x64 move, jump or add instruction codes? No. Will it know x86 instruction codes? No. You have to ask your compiler to use the code generator that makes ARM instruction codes. ARM has 32 bit and 64 bit codes too, that are different ... Besides: If your program was written for Windows, it expects that is can ask the OS (i.e. Windows) for this and that service - and Android says Huh?? The services provided by Android are quite different.
Bottom line: The rosy, cosy days when x86 worked everywhere are over.
Then comes dotNet. When a dotNet 'assembly' (informally you may call it a module) is loaded into your machine, smartphone or whatever, you'll see an incompletely compiled program: The compiler has left a message: Here I should have generated the code for adding the value of X, but I didn't know what kind of machine this will be running on! So please, before running this program, generate the proper instructions for adding X, will you?
dotNet for a given machine has the proper code generator for that machine. dotNet on an ARM32 generates ARM32 codes, dotNet on an ARM64 generates ARM64 codes, dotNet on a 64 bit PC generates x64 codes. They are all different.
For now, Windows itself is not dotNet, so it must be compiled separately for every machine architecture. A growing number of applications are dotNet, incompletely compiled, and the last step of compilation, code generation, is not done until you know for sure which codes to use, just in time for execution. So the dotNet code generator is frequently referred to as the "just in time compiler", or "jitter".
If you want to look at instruction codes and addressing modes and such to see what they are like, my recommendation is to stay away from x64 and x86. They are both a mess, having grown and been extended and grown more and been extend ... into a crow's nest. ARM64 (aka. Aarch64) is certainly not the simple, easily understood thing that the early 32 bit ARMs were, yet it has retained a much more manageable structure.
All you really need to understand is:
- a compiler can be written in anything (more or less) from machine code, through assembler up to most high level languages.
- The output of the compiler must be code that is compatible with the machine that will run the final executable.
- the term "machine" can be the actual hardware, a virtual machine (like the Java Virtual Machine), or Framework such as .NET.
- the actual hardware instructions do not have to be the same across all platforms, but it would be nice. Just as USB connectors keep changing so hardware platforms keep evolving.
In the beginning there was machine language very closely tied to the CPU. It worked with numeric opcodes that programmers has to literally memorize or look up. This got old real quick.
00000000 Stop Program
00000001 Turn bulb fully on
00000010 Turn bulb fully off
00000100 Dim bulb by 10%
00001000 Brighten bulb by 10%
00010000 If bulb is fully on, skip over next instruction
00100000 If bulb is fully off, skip over next instruction
01000000 Go to start of program (address 0)
Enter assembly. It's not a compiler. It's an assembler and also a linker as part of a toolset. There's a difference. A compiler will translate code into something that's a one-to-one correlation with machine instructions. Assembly is already that. It's a language that basically gave human-memorable mnemonics to the opcodes. It was originally written in machine language. It's very CPU specific too. This too got old.
There were a ton of other languages made, presumably written in ASM, but this is where a compiler kicks in. To make a really long story short, I'll just mention C's history. C was based on B and B was based on BCPL. I don't know what BCPL was written in, but the first B compiler was written in BCPL. Eventually, the B compiler was re-written in B itself and then the first C compiler was written in that version of B.
A language written its own compiler happens more than you'd think. Anyway, these are still native compilers and eventually they still make their own down to machine code. Now, things like Java and C# I suspect are still written in native languages for obvious reasons, but don't be surprised if a native language's compiler is written in the same language.
Calin Negru wrote:
a standard should be required where the numbers/machine instructions for MOV are recognized everywhere. I mean it should work like a hardware resource with the same ID present on old and new processors.
This sounds great in theory, but if you look at how bloated and not-fun the Win32 API is, if you always have to maintain backwards compatibility then you keep things bloated when attempting to advance. I mean, it's good on one level, but it's also good to wipe out the old and try something new, like Apple is doing with the M1 chips (even though nothing is every really new, but you get the idea).
Do we really want processors in 100 years having similar constraints as one designed in the 1960s? Rather than enforce that on the CPU, the industry has (correctly so) to rather have compiler targets implemented. You use your preferred language and it compiles down to whatever the CPU expects with optimizations, etc.
C was based on B and B was based on BCPL. I don't know what BCPL was written in, but the first B compiler was written in BCPL
Whilst I, too, don't know what BCPL was written in, I did hear why it was called BCPL. WikiPedia[^] says it stands for "Basic Combined Programming Language" and was invented in Cambridge University (UK). The story that I heard was there was a more complex language jointly designed by universities in Cambridge and London - that was call CPL (Cambridge Plus London). I do not know if CPL saw the light of day; but a simplified version called Basic CPL (or just BCPL) was created.
It had a bizarre construct, which was definitely a candidate for CPs Wierd and Wonderful forum), to resolve the 'Dangling ELSE problem'. It was something like IF condition DO statement and TEST condition THEN statement OR statement. (See https://www.bell-labs.com/usr/dmr/www/bcpl.pdf[^])
I've just read the Wikipedia article (I should have done that before posting!). It says the CPL language was named originally from 'Cambridge Programming Language' and later renamed to 'Combined Programming Language'. No mention of London. But CPL (programming language) - Wikipedia[^] does mention the involvement of London and it was nicknamed 'Cambridge Plus London'. Thus, the name I heard was not its real name.
[feb 2,2023,3:01am] same as Victor, could you describe the context of that message? A veteran probably needs no further clue to guess the source of where that came from but some of us are not veterans (not me at least)
Because you're too lazy to do the work yourself, you post this nonsense just so you can down vote answers and legitimate questions? You are a troll.
From the perspective of an application, a "cancellation point" is any one of a number of POSIX function calls such as open(), and read(), and many others.
"Before entering on an understanding, I have meditated for a long time, and have foreseen what might happen. It is not genius which reveals to me suddenly, secretly, what I have to say or to do in a circumstance unexpected by other people; it is reflection, it is meditation." - Napoleon I