An introduction to modern computers, Part 7 – Computer programming with assembly

note: This post relies heavily on the basics explained in Part 1  , Part 2Part 3Part 4Part 5, and Part 6.

With an understanding of how ROM, RAM and the CPU works, we can now create programs that’ll use these resources.

As mentioned in the previous post, a CPU should implement circuits that fetch, decode and execute instructions. These circuits get their input and produce output in binary form.  The CPU literally “speaks” in binary. In order to “talk” to the CPU, we would either need to get fluent at representing complex ideas in the form of 1’s and 0’s, or we can come up with something more comfortable.

In Part 3 we talked about the interesting connection between binary and hexadecimal numeral systems. If you are wondering why the decimal system doesn’t fit here, remember that the 10 is not a power of 2, and it’s roots probably lay in that ancient fact that humans have 10 fingers in each hand (which we often use for counting). Hex’s roots on the other hand lay in the modern need for a more comfortable way to talk to computers.

While hex is a comfortable way to speak binary, it’s still a long way from being a comfortable way to encode instructions for the CPU. Here’s an example of some 4004 CPU instructions:

D3 20 50 81

These are instructions that would make sense to a 4004, but not to a human (unless he spent a few month memorizing the 4004’s instructions set in hex form).

Programming languages are the interface through which humans can speak to computers. In order to create a programming language the first thing we need is an assembler. The assembler take human readable instructions, also known as assembly code, and converts them to hex digits (which can be seen as a compressed form of binary) which make sense to the CPU that decodes them. How is a assembler created? Well, for the most simple example, think about a typewriter. Instead of having a “qwerty” layout, it could have a layout of legal instructions, registers and numbers. The programmer types in the assembly code, and the typewriter prints out hex digits representing the same instructions on paper.

Lets look at the assembly code that produced the above hex digits:

LDM     $3
FIM       R0R1, $50
ADD     R1

This looks a bit more understandable. We have the symbols R0 and R1, which probably represent registers, and we have an ADD symbol which probably represents an arithmetic addition instruction, and takes the value of R1 as a parameters. Naturally, to fully understand this assembly code we should look into the MCS-4 (Micro Computer Set) manual, chapter VIII. A more readable version can be found at this site.

The assembler turns assembly instructions to object code that can be decoded by the CPU, so object code and its assembly code source is computer architecture specific. This means that the assembly code you wrote for the 4004 will assemble for the 4004 only.  In this post we’ll be focusing on 4004 programming, but programming principles can be easily carried over to over architectures.

Lets begin by creating a small program that adds the values of 5 and 7, and stores the result to memory outside the CPU. From our understanding that CPUs and RAM chips work with registers and communicate over the data bus, we can think of this pseudo-code:

  1. Store 5 in first register.
  2. Store 7 in second register.
  3. Add the values of the first and second register, and store the result in a third register.
  4. Send the address in which we want to store the result to a register in the memory chip over the data bus.
  5. Send the value of the third register over the data bus to the memory chip so it could store it.

Before we can translate this pseudo-code to 4004 assembly code, we must know what CPU resources we have in our disposal. The 4004 has seventeen 4 bit registers:

  • 16 general purpose registers name R0-RF (F as in 0xF, or 15 decimal).
  • 1 accumulator register.

And there are four 12 bit registers:

  • PC (Program Control) register, that holds the ROM row address of the current instruction to fetch and execute from memory.
  • 3 stack registers, their functionality will explained later in the post.

To create a program that’ll add the values of 5 and 7 and store the result in RAM address 0x10,  we will use the following instructions:

LDM – Load data to accumulator – stores a given 4 bit value (0-0xF) to the accumulator register.
FIM – Fetch  immediate (data) from ROM – Fetches 8 bits of data from a given ROM row address and stores them into a register pair (R0R1, or R6R7 for example).
ADD – Adds the value of a designated register to the accumulator register (result is stored in accumulator).
SRC – Send register control – The 8 bit value of a register pair is sent to the RAM’s address registers during instruction cycles X2 and X3. The addressing scheme works like this: The first two bits of the address designate 1 out of 4 chips in the current bank, the next two bits designate 1 out of 4 registers in each chip, and the next 4 bits designate the offset within the register (0-0xF) to which the 4 bit data is to be written.
WRM – Write accumulator to memory – The 4 bit value of the accumulator will be sent to the RAM chip during X2 cycle. The RAM chip would then store the 4 bit value to the address set during the previous SRC instruction.

Lets take a look at the code:

LDM      $5                       ;  Load the value 5 in the accumulator register
FIM        R0R1,  $70     ;  Load the value 7 to R0, and 0 to R1
ADD      R0                       ;  Add the value of R0 to the accumulator register

FIM        R0R1,  $10      ;  Load the value 1 to R0, and 0 to R1
                                             ;  translates to binary 0010000b
SRC       R0R1                 ;  Select RAM chip 0, register 1, offset 0
WRM                                 ;  Store accumulator to RAM

It’s a good practice to comment as many lines of code as possible. Assembly code usually makes much sense to the programmer who wrote it at the time of writing, but trying to understand someone else’s code (or your own after a month or two) without comments can be a difficult task. While it is easy to understand what an individual line of code does does, understanding combined purpose of all the lines in a program is much more difficult. It can be compared to understanding what is portrayed in a huge wall painting by looking at it through a microscope, one fraction of a millimeter at a time.

Now that we have our program’s assembly code, we can assemble it, burn the object code to a 4001 ROM chip, place it on a PCB with a 4004 CPU and a 4002 RAM chip, and run our program! Or we could go copy-paste our code to an online 4004 assembler, copy the object code to the online 4004 emulator and step through the code instruction by instruction. By the time you’ve reached the end of the program at PC 08, the value 0xC will be written to RAM address 0x10 (or RAM bank 0, chip 0, register 1 offset 0). If you are wondering how we could write to another RAM bank – that would require the use of the DCL instruction that controls the CM-RAM pins to activate different banks. This completes the answer to the question that was left open in the previous post on how RAM chip selection works.

There are more basic features that can be implemented in order to a program’s code more understandable, compact and effective:

Routines – Routines individual pieces of code that have a certain functionality. A 4004 routine for example might performs a series of calculations based on a 32 bit value stored in registers R0-R7 and store the 32 bit output value to R8-RF. Lets assume that this routine spans over 2 ROM chips and it can be now integrated into different 4004 programs. All the program has to do is set up the 32 bit number argument in R0-R7, execute the routine’s code, and use the output value in registers R8-RF. 

Code branching – In the above example, a program executed a routine who’s code was sitting in 2 separate ROM chips. But what if this routine needs to be used 10 times during the process of the main program? Does it require 20 extra ROM chips? The solution to this problem is code branching. The main program’s code can be stored in ROM 0-5, and the routine’s code can be stored in ROM 6-7. Now when ever the routine’s code needs to be ran, the main program can “jump” to the sub-routine’s code (The main program that runs when the 4004 starts is a routine, so any branching routine is considered to be a sub-routine). In the 4004, this is implemented by the JMS instruction, which is a 16 bit instruction – 4 bits for the instruction op-code, and 12 address bits representing the ROM row address us the beginning of the sub-routine to be executed.

The JMS instruction also “pushes” the address for the next instruction to be executed after the JMS instruction onto the stack. The reason why it’s considered a “push” is because a JMS instruction stores the next instructions address to the level 1 stack register, while the level 1 register’s value gets pushed to the level 2 register, and the level 2 registers value gets pushed to the level 3 stack register.

Once the sub-routine is done, it will execute the BBL instruction, that’ll “pop” the value of the level 1 register to the PC register. A JMS instruction causes a push PC->lvl1->lvl2->lvl3 (PC now holds the address pointed by the JMS instruction, level 3’s value is overwritten), and a BBL instruction causes a pop lvl3->lvl2->lvl1->PC (PC’s value is overwritten, level 3’s value is now 0). The stack is controlled by the JMS and BBL instructions, and works as a LIFO (last in, first out).

Conditional and unconditional jumps – Codes usually implement conditional jumping to different code addresses depending on user input, or the result of sub-routines. The main difference between jumping and branching is that the stack isn’t used while jumping. The 4004 implements conditional jumping with the JCN instruction, which is a 16 bit instruction – 4 bits for the opcode, 4 bits for the condition, and 8 bits for the address which will be loaded to the lower 8 bits of the PC register. The 4004’s conditional jumps are limited to the current ROM chip, while the JMS instruction can jump to a full 12 bit address (jump to code in different ROM chips). The 4004 also features a 16 bit unconditional jump that can be used by the programmer.

You can now check out the P1 program on the 4004 emulator’s page. The routine fills up RAM chip 0 in bank 0 with a pattern, and also writes to the RAM chip’s status registers. The routine is basic, and features the ISZ instruction as a basic looping mechanism.

For something much more complex, you can check out the reverse-engineered Busicom 141-PF calculator’s firmware. It features some very interesting techniques to compensate for the very basic set of native instructions by using an engine that can parse “pseudo-instructions” , implemented in code at ROM address 0x4b. In my opinion the term “pseudo-instruction” might be a bit misleading, and the term “virtual instruction” might be more appropriate here, as the engine fetches op-codes that the 4004 can’t understand, and translates them to native 4004 instructions. This way, each virtual instruction can be translated to several native instructions. This technique is very interesting, and I recommend reading the descriptions in the link and getting a basic understanding of how this engine works.

I won’t be going over the details of how the 4004 gets the user’s input, because in my opinion the subject isn’t interesting enough to justify going into the excruciating detail. I might make a post in the future explaining how modern keyboards work though.

Lets conclude this post:

  1. CPU’s can execute instructions coded in binary.
  2. This binary code is called “object code”, and is usually represented in hex digits.
  3. The object code is assembled from human readable assembly code.
  4. Assembly code is CPU architecture specific, and makes use of the CPU’s resources.
  5. Assembly is the de-facto lowest level computer programming language (unless you consider writing object code as a programming language).

Despite the fact this post focused on 4004 assembly, the principles discussed are relevant for programming for whatever architecture you might chose, the main difference will be syntax, supported instructions and available resources.

In the next post (which will be the last for the introduction series) we’ll take a higher-level look at modern computer architecture, and discuss the various controllers (including the interrupt controller) and how everything connects to the CPU.

Hope you found this post informative. Feel free to leave comments, and ask questions.


>>> No comments yet

An introduction to modern computers, Part 6 – How the CPU works

note: This post relies heavily on the basics explained in Part 1  , Part 2Part 3, Part 4, and Part 5.

By this part of the introduction series, it’s important to get a sense of how transistors (logical gates) can be connected to create an actual processing unit. So before you continue reading, watch this video.

The CPU (Central Processing Unit) is a great net of interconnected circuits that can carry out instructions. It’s important to understand how these “instructions” are implemented in hardware.

A CPU that can execute 3 different instructions probably has 3 different circuits for each instruction. An instruction doesn’t have to be complex:

This is a circuit that can execute 4 different instructions:

00b (0x0) – Light up red LED.
01b  (0x1)  – Light up green LED.
10b  (0x2) – Light up blue LED.
11b   (0x3) – Light up all LEDs.

All you have to do is supply the circuit with the instruction’s “op-code”. Some instructions can take an argument input. The above circuit could be upgraded with a few flip-flops that’ll blink the LED that is activated. So the instructions would consist of 3 bits,  two for the opcode, and one for the blink parameter.

These instructions can be fed to the CPU from memory. For example, lets combine this circuit with the DRAM circuit from the previous post (click to zoom in):

Since the RAM circuit contains 2 bits in each row, it can hold 4 different instructions in memory addresses 0x0 – 0x3 (basic circuit without the blink argument).

A CPU by definition is much more complex than the above circuit. A CPU has a much bigger instruction set, which could perform basic arithmetic functions (add, subtract, multiply, divide), basic logical functions (xor, or, and, not), basic control and Input Output (I/O) functions (store to RAM, load from RAM).

A CPU must also be able to execute these instructions without being dependent on the type, size, or location of the memory circuit. This is why CPUs have their own memory units, called registers, which are basically highly integrated SRAM memory cells. Instructions that are carried out by the CPU affect the values in the registers. A CPU might use two registers to execute an “add” instruction, the result of which can be stored in a third register, or in one of the two registers that held the values that were added. The bit size of the registers determine the native bit width of the CPU. If the CPU’s registers are 2 bits wide, the CPU speaks in 2 bit “words”.  In comparison, most of the desktop processors you’ll find today speak in 64 bit words.

A CPU also contains special circuits that “fetch” and “decode” instructions. The basic instruction fetch-and-execute cycle usually looks like this:

1. Fetch the instruction from memory (and store it in a CPU register).
2. “Decode” the instruction.
3. Execute the instruction.

Lets see how an imaginary CPU might fetch, decode, and execute instructions:

Since the instructions are sitting in a distant memory circuit, there must be an agreed upon interface by which the CPU and memory chip connect. For this example lets assume a DRAM chip is used as main memory. If an imaginary CPU would like to access the circuitjs DRAM from the previous post, it would need a connection with the Row select inputs (2 bits), the Data inputs (2 bits), Read/Write inputs (2 bits) and the data outputs (2 bits). This means that a 2 bit CPU with 8 Input/Output (I/O) pins connected to the DRAM chip could easily work with that specific DRAM chip.

The instruction opcode could then be fetched in two clock cycles (assuming the time to read a bit from DRAM is much faster than the clock speed). In the first cycle the data, row address, and command lines will be set, and in the second cycle the values of the output will be fed into the CPU’s register.

Once the instruction opcode (which also contains any parameters needed for the instruction’s execution) is fed in a register, another clock cycle is needed to set the instruction handling circuit’s inputs according to the parameters fetched from memory along with the instruction. In the next clock cycle (or cycles), the instruction circuit is activated, and the result bits can be read from the output bit-lines to a register that will hold the output value.

Now lets examine how a real CPU works- the Intel 4004 (mentioned in the previous post), which was introduced in the early 70’s. Despite the fact that it wasn’t a real general purpose CPU, but rather one designed to be used by calculators, there are a few interesting things to learn from its design.  Lets examine the CPU’s packaging and pinout (click to zoom in):

The first thing that might look odd is the small number of pins used. The 4004 is a 4 bit CPU that supports over 40 instructions. In order to support more than 16 (0x10) instructions, there should be a data bus at least 5 bits wide, because instruction number 0x11 translates to binary 10001b. In order to support over 32 instructions, we need a data bus of at least 6 bits. But the 4004 features a 4 bit data bus. It was still possible to fetch a large number of instructions instructions along with their parameters, however this required the instructions to be broken up into pieces and sent over the data bus over several clock cycles (slower performance).

If you’re wondering what’s the reason behind the narrow data bus, the answer is “obscure reasons”. This answer was taken from an interview with one of the 4004’s designers. Often compromises in design are made because of financial considerations. 4 bit data bus and 4 bit registers mean less transistors (less money spent) and more clock cycles to compensate (lower performance).  It’s all a matter of price-performance ratio, and performance might not be an issue when the 4004 is used to power a simple calculator.

So we have the CPU package in the 4004 chip, but where does it fetch instructions from? According to the MCS-4 (Micro Computer Set) manual, page 3 (just at the bottom of the page), the minimum system is a CPU and a ROM chip that holds the CPU’s instructions. Using the MCS-4’s datasheet as reference, lets look at the way the 4004 connects to the 4001 ROM (which holds instructions) and the 4002 RAM (DRAM chip which serves as main RAM for the CPU) [click to zoom in]:

You can see that the 4004 communicates with the 4002 and 4001 via the 4 bit bus, to which all the chips are connected in parallel. This means that all the chips can sample the data on each bit-line at all time, and all chips can drive the data bit lines high or low at will. Naturally a protocol is implemented to make sure all chips communicates over the data bus in an organized fashion. All a chip has to do in order to “send” bits over the data bus is to pull the voltage high or low in sync with the clock shared by all chips on the bus. Usually chips sample the bit line during a “rising edge” of the clock (remember the flip-flops from Part 4) .

A natural question at this point would be “how does the CPU communicate with specific chips?” – First of all, the 4004 features several CM (Command Control) pins which can be used for chip activation and selection. The single CM-ROM (Command Control) pin, which is connected to all 4001 chips, is always active so the ROM chips are always standing by to receive commands from the CPU. This means that ROM chip selection is implemented in a different manner. In order to understand how ROM chip selection works, we need to understand the 4004’s basic operation

Since the 4004 is a CPU, it’s basic operation is fetch and execute instructions (instructions are held in the ROM). This is described in pages 5 and 6 in the MCS-4 manual:

The 4004 has a special register named PC (Program Counter) register, which holds the address of the instruction that needs to be fetched an executed from ROM. This register is special since it is a 12 bit register, meaning that it can hold the maximum value of 0xFFF (4,095 decimal). Since the 4001 holds 2,048 bits of instruction data this seems like too little at first, because the PC register cant even max out 2 4001 chips, despite Intel’s advertisement of 16 ROM chip support. However, by looking closely at the datasheet and manual, one could see that each basic 4004’s instruction is 8 bits wide, and each row in the 4001 chip contains an 8 bit word. The PC register holds a row index, not a bit index, and it can point to 4,096 different rows which hold a to a total of 32,768 bits worth of instructions (exactly 16 4001 chips).

When the 4004 is powered up (or reset), the PC register’s value is 0, so the CPU begins the instruction fetch, which is the first part of the “instruction cycle” (click to zoom in):

At the beginning of an instruction cycle, the sync line is pulsed to sync all ROM chips that are waiting for commands. During each cycle after the sync pulse, 4 bits of data are “sent” through the data bus in parallel and read by all ROM chips.

Cycles A1 and A2 carry the 8 bit address of the 8 bit instruction in the ROM chip, and A3 is the 4 bit chip select code (it is a bit strange that chip selection happens in the end and not the beginning). 4 bits translate to a maximum number of 0xF (15) meaning that ROM chips numbered 0-15 could be addressed (this answers the ROM chip select question). The ROM chip’s numbering is hard wired during the programming and manufacturing process, so when the CPU asks for data from ROM chip number 3 on the bus, there should be no confusion as to which chip was selected.

The 8 bit wide instruction is then sent over the data bus by the ROM chip to the CPU’s registers over the two M1 and M2 cycles, and it is stored in an 8 bit register for “decoding” (which is a fancy way of saying it will be fed in a multiplexer that’ll activate the relevant circuit in the CPU, and feed it with the command’s parameters as inputs). This is followed by the X1 cycle in which the CPU processes the instruction internally and X2 and X3 cycles in which the CPU might communicate with the RAM or ROM chips in order to prepare for a read, write (only for RAM) instruction.

This sums up the 8 cycles needed for the 4004 to fetch and execute a basic instruction in all of its narrow-bandwidth-data-bus-hack glory. Keep in mind that some instructions are 16 bits wide, and they require more cycles to fetch and decode. Take a look at the 4004’s instruction set. The JUN (Jump UNconditional) instruction for example (unconditionally begin executing instructions from a given address) is 16 bits wide – 4 bits for the instruction “op-code”, and 12 bits for the target address to be fed into the PC register in order to begin execution.

Chip selection for the 4002 chips is a bit different. First of all there are 4 CM pins for RAM (CM-RAM0 to 3), which activate different RAM banks (each bank contains 4 DRAM chips). The rest will be explained in detail in the next post.

If you’re wondering how the process of programming a ROM chip and running your program on the 4004 looked like, a simple answer would be – not like anything we are used to today. Back in the 70’s you had to write all you programs in human readable “Assembly” language, which could then be fed into another computer program that reads the lines of code and translates them to binary (usually represented to humans in hex digits). This binary code was then burned into the 4001 ROM chips by Intel, and sent to the client. One client of the MCS-4 is Busicom, a Japanese company that used the MCS-4 in its Busicom 141-PF calculator. Here’s a photo of the 141’s main board:

Take a look at this photo:

You can actually see the 4 bit data bus embedded in the PCB (follow the dark lines), connecting all 4002 and 4001 chips to the 4004!

Another important thing to note is the way a 4 bit processor handles numbers bigger than 0xF. They do it by using shift registers. The idea is that the CPU carries out an addition of two big numbers by calculating 2 digits at a time using an adder circuit (similar to the one used in Part 3), and gradually push the result in a shift register chip to form a bigger and bigger number. The 4003 is the MCS-4’s shift register chip.

To conclude this post, lets go over some important points:

  1. The CPU contains circuits that can handle different instructions.
  2. The CPU contains a special decoding circuit, that decides on which instruction circuit should be activated depending on the bits in, i.e the value of, the register that holds the fetched instruction. It also feeds the decoded input parameters to the instruction circuit.
  3. The instruction fetch circuit sends a series of bits over the data bus, this series of bits is actually a command for a specific ROM chip to send data from a specific address to back to the CPU.
  4. When an instruction is fetched and executed, then Program Counter register is incremented to point to the address of the next instruction to be fetched and executed.
  5. The communication over the data bus is synchronized by the clock the CPU generates over the clk phase 1 and 2 pins.

In the next post, we’ll create a simple assembly program for the 4004, and further discuss the use of registers and memory in computer programs.

Hope you found this post informative. Feel free to leave comments, and ask questions.


>>> No comments yet

An introduction to modern computers, Part 5 – Memory and how DRAM works

note: This post relies heavily on the basics explained in Part 1 , Part 2Part 3 and Part 4.

In the previous posts we did not discuss performance issues at all despite performance being a major factor in computing. So lets talk about performance. We use computers as tools to assist us in doing whatever needs to be done, and we expect them to do it fast. would you find it acceptable if it would take a hand held calculator one whole minute to calculate and display the result of a simple multiplication operation?

So what can cause a performance bottleneck in a well designed logical circuit? The current flows (the propagation of the electromagnetic wave mentioned in Part 1) through the wires at nearly the speed of light, so it seems like the only thing that stands between us and a really fast computer is a super fast clock generator to drive our switching logic. One problem though – physics.

Since transistors control the flow of electrons by saturating a “gate” with electrons (or draining electrons from mentioned gate) there’s a delay in flow control (a delay in how fast a transistor can turn “on” and “off”). This means that if you keep cranking the clock speed up, eventually a one of the transistors in your circuit won’t get enough time to saturate and allow current to pass through it. This means that somewhere in your logic, instead of getting a 1 you get a 0. This is how  a “bug” is born. A possible fix for this bug is to crank up the voltage as you increase the clock speed (this allows faster saturation of electrons in transistors) – however this added tension can destroy the fragile components making up your circuit. Over the years, materials and fabrication techniques improved, allowing for faster switching transistors which in term allowed an increase in clock speed.

So we got the speed issues with the switching logic covered. But what about the memory component? If you want a routine or a program to run, you need to use memory. A CPU without memory is just a bunch (a big bunch though) of switches flicking on and off following the ticks and the tocks provided by the clock. Could memory be a bottleneck in our system?

To answer that question, we need to make some things clear first. In this part of the post the term “memory” is used to describe a technology that stores bits (1’s or 0’s). An example for such technology is the hard disk, in which large amounts of bits (data) are usually stored. During the mid 1970’s the average seek time  for a hard disk was 25ms, and that’s not including the overhead of reading/writing the actual data, and moving it all the way back to the CPU.

Lets assume that with all overheads combined, reading a randomly located bit from the disk would take 30ms on average. Now lets imagine a system in which an Intel 4004  (clocked at 740kHz), which could execute 46300 to 92600 instructions per second (the meaning of an instruction will be discussed in the next post), is paired with a mid 70’s hard disk (which is used as the CPU’s main memory). In this scenario, each random access to memory (needed for a single bit read/write) would cause the CPU to stall (while waiting for the data to be read or written) for a period of time in which it could have executed at least 2000 more instructions!

Notice that the calculations above were made based on random access to memory. When we are talking about memory which will be used by a CPU, we are talking about RAM (Random Access Memory), since the CPU should have the ability to access memory in a non-sequential pattern.

So how do we deal with this memory bottleneck? A natural solution would involve discarding all mechanical components. Over the years, several interesting memory technologies appeared (and some disappeared). One example is magnetic-core memory:

This technology however was expensive, and had implications on the amount of storage available due to the size of the toroids (each toroid holds a single bit, so you can actually see how much RAM is installed in your system).

SRAM (Static RAM) was introduced in the late 1960’s, and was based on a combination of transistors that would create a circuit (similar to a latch) that can “hold” a single bit:

I’ve created a TXT file which can be imported to this site to see how a single SRAM cell functions and play around with it.

SRAM is great. The only problem is that SRAM is expensive. The previously mentioned Intel 4004 was built from approximately 2300 transistors, meaning that an SRAM storage of just 1000 bits (just over 100 bytes) would require almost 3 times more transistors than the CPU.

DRAM (Dynamic RAM) was introduced during the late 1960’s to deal with the cost issues. Noways, DRAM is the most common type of RAM, which you can find just about anywhere. The commonly used DRAM cell is created by a combination of a transistor and a capacitor:

Here’s a TXT file for it.

You’ve probably noticed all the extra components around the SRAM and DRAM cell. They are necessary for writing and reading the bit from the cell.

If you’re wondering “why use the more expensive and complex SRAM when DRAM works just fine?”, and that’s a great question, the answer to which is -performance. In order to change the state of DRAM cell, a capacitor must be charged or discharged. This takes time. Changing the SRAM state is just a matter of activating a few transistors.

Another reason for the performance gap is the fact that the read cycle in DRAM is destructive, and the cell must be rewritten after it is read (this will be explained later in the post). This read overhead doesn’t exist in SRAM.

CPU cache uses SRAM memory cells, which are fast but expensive.  So expensive in fact, that the amount of bits you can store in cache is extremely small compared to the main memory (Modern CPU’s L1 cache is several KBs compared to GBs of main memory). The main memory modules (installed in your motherboard) are DRAM based and significantly slower.

Since DRAM is the most common RAM in use by CPUs today, lets go in dept on how a DRAM circuit actually works.

One important thing to keep in mind while dealing with DRAM is that DRAM leaks. Because there is current leakage between the transistor’s gate, source and drain, the capacitor might discharge or gain charge from neighboring cells. SRAM doesn’t have this problem because it doesn’t use capacitors to store the bit. You can try it with the DRAM TXT file above, you’ll notice that over time the capacitor looses its charge. When the voltage drops below 2.5v, the op-amp (which figures out if the capacitor holds a 0 or a 1 by comparing the voltages on the two lines) will sense a 0 where there was previous a 1.

The solution to DRAM leakage is an independent refresh circuit that periodically refreshes all cells. This requires the cells to be read (destructive) and rewritten. During the refresh cycle the DRAM cells can’t be used. This widens the performance gap between DRAM and SRAM further (despite the fact that the refresh cycle takes about 1% of operating time for modern DRAM chips).

Lets look at an example of a DRAM array. First of all there’s the sample DRAM circuit in circuitjs (under circuits tab -> Sequential Logic -> Dynamic RAM):

This circuit is a good example to understand the basic usage of RAM in general. Bits are stored in rows (also called “word-lines”) and columns (also called “bit-lines”). In this case, there’s a single column connecting 4 rows, so the output data will always be 1 bit wide. Each column in a DRAM array has its own charge sensing unit (or “sense amplifier”). Basically, the row select in the above picture is the address of the data we want to read/write.

The operation of a real DRAM cell is a bit different than the circuitjs version. While the circuitjs version uses an op-amp for a direct voltage comparison between the capacitor and the +2.5v line, a real DRAM circuit makes use of the physical attribute of the bit-line. The bit-line (column) is a relatively long piece of conductive material means that can be charged with electrons. A bit-line can actually hold around 10 times the charge of a single DRAM cell’s transistor (which can hold the charge of a few femtofarads). Using  this physical attribute, the read cycle is performed following these steps (Lets assume for this example that a fully charged capacitor is at 5v):

Pre-charge phase – The bit-line is pre-charged to half the voltage of a fully charged capacitor (if a charged capacitor is 5v, the bit-line will be pre-charged to 2.5v).

Row activation phase – When the row is “activated”, the transistor allows charge to flow between the capacitor and the bit-line. If the capacitor was empty, some charge will flow to it from the bit-line and the bit-line’s voltage will fall below 2.5v. If the capacitor was charged, some charge will flow from it to the bit-line, and the bit-line’s voltage will rise over 2.5v. During this phase the voltage of a fully charged capacitor will probably drop below 2.5, and an empty capacity will probably be charged over 2.5v. For this reason, the read cycle is considered destructive – the former state of the capacitor (representing the stored bit ) is lost.

Sensing phase – The voltage on the bit-line is compared with the voltage of another pre-charged bit-line to figure out if the cell’s value is 1 or 0. In order for this to work the DRAM array is organized in pairs:

You can see that when a word-line is activated, each cell is connected to a bit-line on its left, while on its right there’s a bit-line that will not be connected to a capacitor. The result of the comparison is stored in a buffer. Since the entire row is activated, this buffer is called the “row buffer”.

Recharge phase – The cells are recharged according to their sensed values.

While the circuitjs sample is nice for getting an impression of how DRAM works, it lacks an independent refresh circuit, doesn’t have separate controls for reading and writing, doesn’t have a row buffer, and so on…

I’ve created a two bit 4 cell DRAM array which contains the above mentioned features. Get the TXT file and check it out. Lets go over the main components:

First of all the refresh circuit is automated and performs a refresh every 40ms. While the refresh is in progress, the read and write controls are disconnected. The capacitor array has now doubled in size, and each read and write operation affect the two cells in each row. The result of reads is now stored in a row buffer.

While this configuration looks over-the-top for just 8 memory cells, keep in mind that in order to increase the amount of memory in the circuit, it is possible to add more columns (along with their sense amplifiers, write drives and buffers), without any need to change (add transistors) the DRAM control, row select and refresh circuits.

Much more information on how memory works can be found in this great book.

Here’s a real world implementation of the above as seen in this diagram of the Intel 4002’s (RAM chip) internals (holds 80*4 bits of data):

Notice the sensing and row buffer components for each column.

Another type of memory that should be mentioned is ROM (Read Only Memory). Besides containing the grid of rows and columns which hold the bits (by being hard-wired to ground or voltage source), a ROM chip should hold a control circuit that allows fetching of the data.  There are several ways to implement a PROM (Programmable ROM), one of which is the use of UV-light. A manufacturer could program a ROM using UV-light, encase it in epoxy, and thus creating a one time programmable ROM (unless someone goes through the trouble of decapping a chip and reprogramming it with UV). The Intel 4001 is the ROM counterpart of the 4002.

By this point we’ve covered the basics of how memory, ROM and RAM works. In the next post we’ll talk about how a CPU works.

Hope you found this post informative. Feel free to leave comments, and ask questions.


>>> No comments yet