Maarten Pennings'
NDS(L) - Nintendo dual screen (lite)

Table of contents

1. The hardware     1.1. Falling for Nintendo hardware (or Nintendo marketing?)     1.2. Finding hardware for homebrew     1.3. Finding a shop for homebrew hardware     1.4. Installing firmware for homebrew hardware     1.5. Architecture of an NDS with M3DS simply     1.6. Running software from the M3DS 2. Starting with homebrew     2.1. Finding homebrew on the web     2.2. Getting devkitPro     2.3. Hello world!     2.4. Makefiles 3. Writing my first application: the watch     3.1. Watch 1 - The watch that comes with devkitPro     3.2. Watch 2 - The watch 1/3 from devkitPro, 2/3 Chris Double     3.3. Watch 3 - The watch in modern devkitPro style 4. Adding sound to the watch     4.1. Converting sound so that the NDS can play it     4.2. Adding sound fragments to the makefile     4.3. Watch 4 - Writing the code to play sounds     4.4. Sound theory (in four parts)     4.5. Multiple sounds at the same time and to fixed channels     4.6. Synchronizing sound commands between Arm9 and Arm7     4.7. Watch 5 - Talking time 5. Adding keys to the watch     5.1. Reinstall devkitPro     5.2. Analysing the new devkitPro clock code     5.3. The theory behind keys     5.4. The touch screen 6. Video architecture     6.1. Video cores     6.2. Video "layers"     6.3. Video modes     6.4. Video bases     6.5. Video banks     6.6. Video demo 1: multiple backgrounds     6.7. Video demo 2: both screens     6.8. Video demo 3: snake 7. Scratch area     7.1. Possible watch improvements     7.2. Open Issues

Table of downloads

Skin Watch1 Watch2 Watch3 Sounds Watch4 Watch5 Watch6 Snake BG demo LCD x2 demo

1. The hardware

1.1. Falling for Nintendo hardware (or Nintendo marketing?)

Those folks at Nintendo do a good job. I never play computer games. I did, some 30 years ago. On a Commodore 64. I own several PCs, I play with them a lot. But not games. But now I bought the Nintendo Dual Screen (lite), NDS for short. Because its selling points sound so cute. It sells enormously (85 million in 4 years), much better than the PSP (which looks sexier if you ask me).

During the week of April 18, the Japanese market saw the release of two newly colored DS units, as well as the release of the pet simulator Nintendogs. That week, 96,191 Nintendo DS units were sold, compared to the PSP's 33,004 units. In fact, the DS sold more units in that week than the PSP, Xbox, PS2, GameCube, Game Boy Advance, and GBA SP combined.
from 1up

Of course I started gaming. See my list of favorite games for what a non-gamer likes.

And then, I learned that you could write your own stuff for the NDS. It's known as homebrew. You do need some extra hardware for that. It used to be complicated, but these days you can buy an NDS card that holds an SD card on which you can load software (using a USB addapter in a PC).

Let's see if we can do that (thanks mr_hanky's for helping me with the first steps).

1.2. Finding hardware for homebrew

A nice card seems the 'M3DS simply', about €35. It hosts microSD cards, which may hold several executables. It runs homebrew, but also plays video, MP3, txt files and, of course, pirated .nds roms.

It seems that the 'R4DS' is from another company. The R4 company licenses the R4 to the M3 company, which sells it under the name 'M3DS simply' [update nov 2007: the M3 company has made its own card now, it's called 'M3DS Real']. The R4 company is one step ahead in firmware (allthough usually in chinese version). Price is the same as 'M3DS simply'. It seems (at least for earlier versions) to be possible to recat an 'R4DS' chinese to an 'R4DS' english, but also to recat an 'R4DS' to an 'M3DS simply'.

The M3DS simply and R4 for that matter are so called 'slot 1 solutions', where the homebrew card goes into the native NDS slot (slot 1). This is state of the art. Previous solutions were 'slot 2 solutions' which go in slot 2, the GBA slot of the NDS. For slot 2 solutions, one usually needs an extra card in slot 1 (passcard).
If you don't know all this, the market is quite confusing. The M3 company has, next to the 'M3DS simply', also a plain 'M3' (sometimes also called 'M3 adapter'). This is their (old) slot 2 solution. They also have an M3 known as 'M3 lite', which is a actually an 'M3' with a form-factor fitting into an NDS lite (as opposed to a 1st generation NDS). The 'M3' and 'M3 lite' come in different flash styles, i.e. using different flash card technologies (compact flash, microSD etc), giving a suffix to the name ('M3 Adapter CF/SD Version').
Still with me? Let me try to shake you off. The M3 comes in two variants: 'M3 Perfect' and 'M3 Pro'. The perfect version is about €55, the pro version is only €30. Both play NDS and GBA (but the pro can't play GBA games over 32M) and both play video and MP3 and homebrew.
Often, the M3 is sold together with a PassCard3 (€20). The reason is that running NDS homebrew using a SLOT-2 storage device requires a booting tool, which instructs the NDS to run code from the GBA slot. A booting tool is not required for SLOT-1 devices, nor is a booting tool required to use GBA homebrew on the DS. The 'M3DS simply' also functions as pass card for the 'M3 adapter', it replaces the PassCard3. I'm not sure if the R4DS does the same.

A good comparisson of the M3 simply and R4DS is written by scorpei.

1.3. Finding a shop for homebrew hardware

I'm surprised by the price differences for the M3DS simply. Prices in US dollars at 17 june 2007.

StoreM3DS simplyR4DSRemarks
winsunx (chinese)$25.99$25.99This shop vanished
winsunx (english)-$28.99This shop vanished
flashlinker-shop($73) €55.00($73) €55.00

I ordered four M3DS simply's, each being $43.90 Canadian dollar. Plus $30,21 CAD shipping. As non-Canadian resident, Kick-gaming required me to use PayPal, which, fortunately, is possible without having an actual PayPal account (as long as you have the good ol' credit card). Naturally, PayPal plusses an extra 2.5%. All in all, I had to pay €150 for four, which is €37.50 per M3DS simply. Don't tell anybody, but that's cheaper than a hot NDS game.

I ordered thursday night (CET), and the package arrived wednesday, in good order!

1.4. Installing firmware for homebrew hardware

I bought a "SanDisk Ultra II microSD 2.0GB card" at MediaMarkt for €35. This is way cheaper than in the US or Canada, that's why I didn't order it together with the M3 Simply. Later I bought "slower" SD cards, they also seem to work.

In the box of the M3 simply, there is a mini-CD. But everybody warns that it is probably outdated. And, in my case it was. My CD has version V1.03 (2007-3-12), the website already has v1.06. Similarly, the mini-CD featured moonshell 1.5, the website has 1.6. Note however, that the m3adapter website is not so professional, so I downloaded the latest firmware from nbrew.

How to install? I began by inserting the M3 simply in the NDS without the flash card. After starting, the upper screen displays "Couldn't find the SD/TF card". So, I inserted th M3 simply with flash card in the NDS and switched it on. The upper screen says "Couldn't find _DS_MENU.DAT". These two to-the-point error messages give confidence in the M3 product.

This probably also means the (brand new) flash card is pre-formatted. I'm wondering whether this is FAT32 or FAT16, the latter is generally recommended.

The manual is rather unclear on the "install", i.e. on which files to put where on the flash card. My understanding is that you need two files and one folder all in the root of the flash card. The files are "_DS_MENU.DAT" which is the firmware (some call it the 'menu software' or 'loader', the manual calls it 'game kernel', I call it the firmware) and "_DS_MSHL.NDS" which is moonshell (the manual calls it 'media kernel', I call it the media player). The folder (and its contents) is needed for moonshell to work; it seems to be called 'moonshl' (but 'shell' seems also ok). It seems optional to copy a second directory to the root: '_system_'. It holds, amongst others, a file with cheat codes (also see for cheat codes, and see nbrew for instructions how to use them).

So probably this is best summarized into copy 'E:\system v1.03\english' (from the mini-CD) to 'F:\' (the flash card).

+- _DS_MENU.DAT     the core file - firmware 
+- _ds_mshl.nds     this is the media player (probably can be left out) 
+- _system_\        contains firmware support files (maybe can be left out) 
|  +- CHEAT.DAT     some games can be patched upon loading to cheat 
|  +- gbaframe.bmp  the firmware supports multi-skins, this is a part of one skin 
|  +- ebook\        ?? fonts for the ebook reader in firmware 
|     +- ...        3 files not shown here 
+- moonshl\         contains media player support files 
   +- bookmrk0.sav
   +- bookmrk1.sav
   +- bookmrk2.sav
   +- bookmrk3.sav
   +- lang0.ini
   +- lang1.ini
   +- moonshl.ini
   +- moonshl.sav
   +- resume.sav
   +- shutdown.mp3
   +- startup.mp3
   +- system.ank
   +- system.fon
   +- system.l2u
   +- custom\
   |  +- ...        19 files not shown 
   |  +- lang0\
   |  |  +- ...     52 files not shown 
   |  +- lang1\
   |     +- ...     11 files not shown 
   +- plugin\
   |  +- ...        51 files not shown 
   +- skin\
      +- ...        18 files not shown 

Since I'm running on Win98 (but, hey, 2nd edition, so with USB), the microSD USB addapter is not supported (that is the USB mass storage class driver is not part of Win98 SE; however I found a site that has Win98 drivers).

I made my own skin – that is to say, I addapted the skin of 'donruper' found on If you like it, it's available for download! All the skin files should go in /_system_ directory.

The top and lower screen when using my skin.

Moonshell is playing fine, but when it boots it showns on the lower screen

An error was detected while trying to access the disc or file.
Please confirm the followings:

Did the setup end normally?
Is the "/moonshl" folder moved or deleted?
Have you enabled the resume function with a media that is not supported?
Do you have enough free space on your media?
Please re-format the media and try again.
Please try with different media from another manufacturer.

On the other hand, I found on the web "In the current version 1.41 (at least on the SuperCard) it is normal to see an error screen, but don’t worry it goes away."

1.5. Architecture of an NDS with M3DS simply

How should we look at and NDS with an M3DS simply card?

My mental model is that of an old-fashioned MSDOS PC (see diagram below). The PC (NDS) has a BIOS. The BIOS contains some low level code. One of the tasks of the code is to manage some low-level hardware like the internal clock and the WiFi. The associated data (what hour/minute respectively what SSID/WEP-key) is stored in the CMOS. Another relevant task of the PC BIOS is to check if the installed harddisk has a valid MBR (master boot record), and if so, to run that. This is the same for the NDS (and even more rigid): the NDS console checks whether the inserted game card is a valid one (there is some cryptographic challenge-response system employed by Nintendo), before the BIOS starts running it. In the case of the M3DS simply card, the code in the "MBR" runs _ds_menu.dat, much like the boot record of an MSDOS disk runs And allthough is a console/textual application and _ds_menu.dat a graphical one, both serve the same purpose: let the end-user browse for an executable in the filesystem (on the micro SD card), then load it and run it. Since the main purpose of _ds_menu.dat is to select, load and run an executable from the mircoSD, loader is not such a bad name either.

The NDS with an M3DS simply is much like an MSDOS computer.

1.6. Running software from the M3DS

So, we have an NDS, with an M3DS simply, and an mucro SD card, with firmware. What other software can we put on? There's two answers: there is illegal software and there is homebrew.

By the way, it is possible to brick the NDS with homebrew: see for more details on a Trojan and unbricking. It is also possible to protect against that (I didn't do that – yet) with a Nintendo DS firmware replacement FlashMe: the first part of firmware flash is write-protected with the SL1 contact, in case malware erases the rest, you can still reinstall FlashMe.

Illegal software: There is an amazing number of pirate .nds files available. I've no clue how they are ripped [note dec 2007: the nds example eeprom suggests a possible solution]. I don't understand how the rippers obtain the games. I don't understand how they make money putting the illegal stuff on the web (they have to pay for the download bandwidth). I don't understand how they agree on a single numbering scheme for the ripped roms [note dec 2007: maybe the numbers are stored in the games, and handed out by Nintendo]. But it's there (checked nov 2007) (rechecked may 2010):

On these sites each rom has a 4 digit release number and usually a regional classification: either U (USA), E (Europe, that is English, Italian, French, Spanish, German, Netherlands), J (Japan), K (Korea), H (Holland, Netherlands), F (France), G (Germany), I (Italy) or S (Spain).

Legal software: There is quite an amount of homebrew software (see e.g. wikipedia or This is stuff written by fellow programmers, usually published for free:

We have the hardware, we installed the firmware, we tried homebrew. Now for the real thing: write homebrew!

2. Starting with homebrew

2.1. Finding homebrew on the web

So, I want to write my own homebrew. Not sure what yet. But let's start anyhow. But where?

I soon found a couple of "tutorials":

There are also some sites with technical background:

So, what do these have in common?

The devkitPro logo.

They all suggest devkitPro. It is a collection of toolchains for homebrew developers. devkitARM is the ARM toolchain of devkitPro (there is also e.g. a power pc toolchain). devkitArm contains gcc, the gnu compiler collection, and related tools. Furthermore, in devkitPro, libraries and header files available for Game Boy Advance, GP32, Playstation Portable, GameCube, and the Nintendo DS. For the NDS, we need from devkitPro: devkitARM, libnds and its headers (and some software to make all these Unix tools work on Windows). Libnds started out as a collection of defines for common memory locations in the DS. Today, libnds is an very useful library that is used by most of the Nintendo DS homebrew community.

Where to find the official resources?

2.2. Getting devkitPro

Let's do some real work after all this studying. Let's download devkitPro and install it compile a sample application, upload it to the NDS and run it.

From the devkitPro homepage, select the download section, then Windows installer, which brings us to a sourceforge site. Download the 'updater' (I had version devkitProUpdater-1.4.4.exe); this is a rather small (200kB) application that downloads and installs devkitPro. Run it.

I did not install 'Programmer's Notepad' (I have another editor :-). I also skipped devkitPSP and devkitPPC. I probably should have skipped GBA and Mirko. I don't like installing software directly in the root (c:\devkitPro), so I followed my proven approach of installing in c:\Programs. My brother in law tried c:\Program Files\devkitPro and that failed! Indeed, later we found a line telling that spaces should not occur in directory path (why doesn't the installer check for that – every Windows users installs all stuff in Program_Files)!

The devkitPro installation tree.

The devkitARM directory contains the Arm compiler. The examples directory contains several examples: motion, filesystem, 2D and 3D graphics, user input, a watch, simple sound. Insight seems to be a debugger. To my surprise libnds was included.

The msys directory is the control center. It brings unix tools (like bash and make) to windows (and probably also part of the Unix api fro gcc etc). After installation you typically run Start|Programs|devkitPro|MSys. But in my case it failed with a 'Cannot find the rxvt.exe or sh.exe binary -- aborting.' message.

Running MSys failed...

I was not sure why running MSys failed, but the MSys command in the start menu was linked to a batch file (to C:\Programs\devkitPro\msys\msys.bat to be precise). And batch files are relatively easy to debug. I was suspecting it had to do with me running on Win98 (don't laugh), but to my surprise I found line 23 rem ember that we only execute here if we are in This suggests that Win9x is supported (WinNT+ has cmd.exe instead of So, it MSys should run on my Win98...

On line 33 the working directory is set (set WD=.\bin\) and a quick inspection learned that the current directory at that point was C:\Programs\devkitPro\msys\bin which doesn't have a bin child directory. I took the liberty to change line 33 into set WD=..\bin\ and we were flying (not the double ..)! My brother in law found another way to fix this problem.

Just to be sure, I added the following lines to c:\autoexec.bat

rem - Added by Maarten for devkitPro
SET Path=%Path%;C:\Programs\devkitPro\msys\bin
SET Path=%Path%;C:\Programs\devkitPro\devkitARM\bin
SET Path=%Path%;C:\Programs\devkitPro\insight\bin
SET DEVKITPRO=/c/Programs/devkitPro
SET DEVKITARM=/c/Programs/devkitPro/devkitARM

2.3. Hello world!

Next step: let's compile a sample application, upload it to the NDS and run it.

I downloaded the file from Chris Double's site. I still had to decide where to store my sources, but then I saw the home directory where MSys defaults to. So I decided to create a demo1 directory in maarten's home.

Storing my first project in its own directory in maarten's home

I added one line to demo1\source\arm9_main.cpp just be be sure it was my file being compiled!

// Demo1 ARM9 Code - Based on an example shipped with NDSLIB.
// Chris Double (


int main(void)
  // Use the touch screen for output
  videoSetModeSub(MODE_0_2D | DISPLAY_BG0_ACTIVE);

  // Set the colour of the font to White.
  BG_PALETTE_SUB[255] = RGB15(31,31,31);

  consoleInitDefault((u16*)SCREEN_BASE_BLOCK_SUB(31), (u16*)CHAR_BASE_BLOCK_SUB(0), 16);

  printf("\n\n\tHello World!\n");
  printf("Changed by Maarten Pennings\n");   // Added by Maarten
  while(1) {
    touchPosition touchXY = touchReadXY();
    printf("Touch x = %d   \n", touchXY.px);
    printf("Touch y = %d   \n",;
  return 0;

I started MSys, changed to the demo1 directory and issued a make.

My first make. Successful. First time!

I used ftp on my PC (and DSFTP on my NDS) to transfer demo1.nds. I rebooted my NDS and ran demo1.nds. It worked!

Proof of success

By the way: one way to stop the MSys (bash) console is typing ^D.

To improve the edit-compile-debug cycle, I encourage everyone to download an emulator. I picked the one from nocash. It small, but does a lot, no install hastle, it just runs!

Another tip to improve the edit-compile-debug cycle: use assert's. Yes: devkitPro made them work!

2.4. Makefiles

On PCs, you have a .c file, compile that to an .obj file and link that (with some libraries) to an .exe file. When the number of files grows, and the number of steps grows (e.g. converting audio or graphics sources to a linkable format), the build process needs to be automated. On PCs, the automation is usualy hidden in so-called project files of IDEs (integrated development environments), but those project files usually map to what is still heavily used in command line environments: makefiles. In case of the devkitPro, makefiles are used to manage the build process. Even for smaller applications, the number of steps (and thus files) is bigger than one might expect at first sight. The reason is that the NDS has two processors (Arm9 for main task and an auxilary Arm7), both of which need an executable.

For a small application, we only write one source file. Since it is for the Arm9, and since we typically write c++ programs, let's call our one source arm9_main.cpp. We use a version of gcc to compile it to arm9_main.o. The makefile fragment that manage this step consists of two lines.

arm9_main.o: arm9_main.cpp
  arm-eabi-g++ ... -mcpu=arm9tdmi ... -DARM9 -c arm9_main.cpp -oarm9_main.o

The first line lists the target (arm9_main.o), a colon, and the sources (only one: arm9_main.cpp). The second part (the second line, but there could be more than one) shows the command(s) that convert the mentioned source(s) to the mentioned target (by the way, these line must start with a TAB). In this case there is a single command: a wrapper tool for the gcc cross compiler for Arm. I've removed several of the detailed options in the fragment, but kept some others to clarify aspects. For example, the option -mcpu=arm9tdmi tells gcc to generate code for an Arm9 core, the option -DARM9 defines the symbol "ARM9" so that one can write #ifdef ARM9 in source files. The next two options tell gcc the source file (arm9_main.cpp) and the target file arm9_main.o.

The object file (arm9_main.o) needs to be converted into an executable. Executables on windows are known as exe, unix executables usually have the ELF format, but arm has changed to eabi. The makefile fragment that creates the eabi from the object has the following form.

arm9.eabi: arm9_main.o
  arm-eabi-g++ ... -specs=ds_arm9.specs arm9_main.o ... -lnds9 -o arm9.eabi

Again, the first line lists the target (arm9.eabi), a colon, and the sources (only one: arm9_main.o). The second line shows the command that converts the mentioned source to the mentioned target. The same gcc wrapper tool is called (with different options). The source arm9_main.o is passed, but also a source that is not mentioned in the source list (because it is assumed to be there, it is not build by this make file): the library for the nds on arm9: nds9. The last option identifies the target file arm9.eabi.

You might think that this is it, but it isn't. The problem is that an nds executable needs two eabi files, one for the Arm9 and one for the Arm7. The tool to compose such a dual-eabi is called ndstool. But ndstool is not given the eabi files just produced. Dev-scene explains that the loader does not handle the .eabi format very easily so we need to strip away all extra info using objcopy. This leaves with a nice flat binary for execution (the .eabi file contains debug information and other things). Hence the following makefile fragment that uses arm-eabi-objcopy to convert the .eabi to a .bin.

arm9.bin: arm9.eabi
  arm-eabi-objcopy -O binary arm9.eabi arm9.bin

We're nearly there. We do need the same three fragments to convert the arm7_main.cpp to arm7.bin. And then we have the fragment to instruct ndstool to combine the two "binary" eabi's to the MyApp.nds file.

MyApp.nds: arm7.bin arm9.bin
  ndstool -c MyApp.nds -9 arm9.bin -7 arm7.bin

The whole makefile should now be rather clear.

all: MyApp.nds

# The 3 steps to build the Arm7 code
arm7_main.o: arm7_main.cpp
  arm-eabi-g++ ... -mcpu=arm7tdmi ... -DARM7 -c arm7_main.cpp -oarm7_main.o

arm7.eabi: arm7_main.o
  arm-eabi-g++ ... -specs=ds_arm7.specs arm7_main.o ... -lnds7 -oarm7.eabi

arm7.bin: arm7.eabi
  arm-eabi-objcopy -O binary arm7.eabi arm7.bin

# The 3 steps to build the Arm9 code
arm9_main.o: arm9_main.cpp
  arm-eabi-g++ ... -mcpu=arm9tdmi ... -DARM9 -c arm9_main.cpp -oarm9_main.o

arm9.eabi: arm9_main.o
  arm-eabi-g++ ... -specs=ds_arm9.specs arm9_main.o ... -lnds9 -o arm9.eabi

arm9.bin: arm9.eabi
  arm-eabi-objcopy -O binary arm9.eabi arm9.bin

# The step to combine them an .nds file
MyApp.nds: arm7.bin arm9.bin
  ndstool -c MyApp.nds -9 arm9.bin -7 arm7.bin

Notice the very first step all that requires MyApp.nds but without any actual working steps. The reason for having this first fragment is that make by default tries to make the first target (which would have been the senseless arm7_main.o).

This makefile is very insightful, and I thank Chris Double for putting it on his website. Unfortunatley the real makefiles used in devkitPro are different. They are different on at least two accounts: they do not explicitly make the Arm7 code (it is available "on stock" in ndstool). Secondly the debkitPro makefiles are generic. This is a feature of make which allows one to factorize the names of the actual .cpp/.o/.eabi/.bin files. This makes the makefiles reusable, but also far less understandable...

3. Writing my first application: the watch

3.1. Watch 1 - The watch that comes with devkitPro

We have all that is needed: devkitPro, a working build cycle, DSFTP. Let's pick our first project. A good source for information as well as inspiration are the examples that come with devkitPro. See devkitPro\examples\nds.

The 3D graphics examples are very impressing, but also a bit scary for a first application. The watch on the other hand is too dull.

There is one thing with the watch though: it isn't working!. Yes, I took the original (download [note jan 2008: the source might work for you, depending on which Arm7 code is linked in by your environment]) example that comes with devkitPro. It builds. It runs. But all three hands of the clock point up/north/12. And the textual read-out says zero, zero, zero...

Of course, I first believed I'd done something wrong. An example, right out of the devkitPro box. Not working. But whatever I tried, no luck. At some moment, I realized that the clock chip is actually read by the Arm7 core and IPCed (inter processor communication) to the Arm9 for display. What would be the Arm7 code? I couldn't find it in devkitPro. The rumour goes that the upcoming "ndstool uses the default Arm7 core distributed with libnds in preference to an embedded one".

So, we need to start a new project, were we are in control of the Arm7 code. Fortunately, I remembered that Chris Double (next to the makefile we already studied) published an Arm7 file on his website.

3.2. Watch 2 - The watch 1/3 from devkitPro, 2/3 Chris Double

I downloaded the Arm7 code and makefile from Chris Double, and took the watch code from the Arm9 from devkitPro.

The problem with Chris' files is that they are old. The build was far from successful. These were the issues I fixed (see next changes).

Now, the code compiled. And the code ran. And the clock was moving. Albeit very strangely.

I didn't understand why the seconds were in the month position (etc, and the whole date was gone), but I had a running clock (download!

At the moment, I found the sources of libnds. This confirmed the move of the BCD decode. And it showed why the time fields were at the wrong place and the date fields were missing: the GetTime only reads 3 bytes, we need GetTimeAndDate to read all info.

void rtcGetTime(uint8 * time) {
  uint8 command, status;

  command = READ_TIME;
  rtcTransaction(&command, 1, time, 3);

  command = READ_STATUS_REG1;
  rtcTransaction(&command, 1, &status, 1);
  if ( status & STATUS_24HRS ) {
    time[0] &= 0x3f;
  } else {

void rtcGetTimeAndDate(uint8 * time) {
  uint8 command, status;

  command = READ_TIME_AND_DATE;
  rtcTransaction(&command, 1, time, 7);

  command = READ_STATUS_REG1;
  rtcTransaction(&command, 1, &status, 1);

  if ( status & STATUS_24HRS ) {
    time[4] &= 0x3f;
  } else {

3.3. Watch 3 - The watch in modern devkitPro style

The first Watch program didn't work, the second one did, but was old-fashioned style. So now we go for strike three: doing it the official way. The most important part is getting the official makefile: I noticed that the directory devkitPro\examples\nds\templates\combined contains a c-file template for the Arm7 and the Arm9, it contains a makefile for the Arm7 and Arm9, and it contains a top-level makefile.

The template project, with makefiles and source files.

Makefile basics is simple. In this case the Arm7 makefile builds the Arm7 executable, the Arm9 makefile builds the Arm9 executable, and the top-level makefile calls the two sub makefiles, and then packs the two executables with an icon into and nds file. On the other hand, make in all details gets quite complex. See the gnu make manual for all gorey details. I will point out some high-lights here.

The first aspect we notice, is a check for the existance of the environment variable DEVKITARM.

ifeq ($(strip $(DEVKITARM)),)
$(error "Please set DEVKITARM in your environment. export DEVKITARM=devkitARM)

Then comes an important part: the top-level make file includes a standard makefile in the DEVKITARM directory.

include $(DEVKITARM)/ds_rules

Let's take a small detour there. The ds_rules contains some standard rules, but also another include:

include $(DEVKITARM)/base_rules

Let's also have a quick look at this third file. It contains variables for most compile tools:

PREFIX    :=  arm-eabi-

export CC :=  $(PREFIX)gcc
export CXX  :=  $(PREFIX)g++
export AS :=  $(PREFIX)as
export AR :=  $(PREFIX)ar
export OBJCOPY  :=  $(PREFIX)objcopy

and also the basic rules using these tools, e.g.

%.o: %.c
  @echo $(notdir $<)
  $(CC) -MMD -MP -MF $(DEPSDIR)/$*.d $(CFLAGS) -c $< -o $@

The $(CC) thus expands to arm-eabi-gcc. Also recall that $< is the automatic variable that holds the name of the prerequisite and $@ is the automatic variable that holds the name of the target

But let's go back to the top-level makefile.

The next line cleverly uses the name of the current directory as the name for the target. Also a variable is set to record the top-level directory.

export TARGET   :=  $(shell basename $(CURDIR))
export TOPDIR   :=  $(CURDIR)

The next line contains a phony target. As the GNU make manual explains: a phony target will be made regardless of whether it exists. In this case, it forces the two sub-make files (for the Arm7 and the Arm9 sub-projects to always be executed - see below).

.PHONY: $(TARGET).arm7 $(TARGET).arm9

Next, we find the top-level target xxxx.ds.gba that I never use, and its direct predecessor xxxx.nds.

all: $(TARGET).ds.gba

$(TARGET).ds.gba  : $(TARGET).nds

The next block is the only block I touched (I added the red part, including the continuation character \). This rule takes care of merging the Arm7 and Arm9 binaries into one nds package. The red part adds a logo and a three line (semicolons are separators) comment to the file.

$(TARGET).nds : $(TARGET).arm7 $(TARGET).arm9
  ndstool -c $(TARGET).nds -7 $(TARGET).arm7 -9 $(TARGET).arm9 \
  -b logo.bmp "Watch3;Maarten Pennings;2007 November 16"

By the way, getting a working bmp file was not trivial: ndstool wants a bmp-file with a palette! I used Microsoft paint to create a 32x32 bitmap; saved that as bmp file. By default this is a true-color bmp (24 bits per pixel and no palette), so ndstool doesn't like it. You have to save the bmp with a palette, e.g. as a '16 Color Bitmap'.

The next targets are the phony targets from the beginning, ensuring a recursive call of make, for each of the two arms. Recall from the figure at the start of this paragraph that each of the two processors had its own directory, with its own sources and its own makefiles. It's these two makefiles that are called here.

$(TARGET).arm7  : arm7/$(TARGET).elf
$(TARGET).arm9  : arm9/$(TARGET).elf

  $(MAKE) -C arm7

  $(MAKE) -C arm9

The final part performs a clean (deletion of all generated files). This too has two recursive steps to the makefiles of the two processors.

  $(MAKE) -C arm9 clean
  $(MAKE) -C arm7 clean
  rm -f $(TARGET).ds.gba $(TARGET).nds $(TARGET).arm7 $(TARGET).arm9

The two sub makefiles (Arm7 and Arm9) are not identical. The good news is that the differences are either unnecesary, or understandable. The best news is that there is no need to change them. Even though I changed the names of the c files (from template.c in arm7.c respectively arm9.c!

So much for the makefiles. Which changes are needed in the c files?

The change in the Arm7 template is marginal: I only added a couple of lines to read the real-time-clock and put it in the IPC struct. The Arm9 template has been rewritten completely; instead I used a clean-ed up version of watch2. See the sources (and nds file) in for details. It now even features month and day-of-week as text, the hands move more analogue-like, it has white tick marks for the hours, and the hands are kite shaped (instead of rectangluar).

One last remark. The code somtimes features a printf("\x1b[7;2H"). Such a command is called an ANSI terminal control escape sequence. ANSI because that institute made the standard; escape because all the sequences start with the ANSI code 27 which is known as escape and which is written in C with \x1b; sequence because after the escape, a series of characters follow; and terminal control because these escape sequences control the terminal, i.e. the screen with the cursor. The sequnce just mentioned moves the cursor to row 7 column 2. The syntax for the sequence is Cursor Home: \x1b[row;colH. See this site for more (allthough I'm not sure which escape sequences are supported by the libnds terminal).

4. Adding sound to the watch

The next task I assigned myself was adding sound to the clock. The standard way I found on the internet was to play a pre-recorded sound. The big picture is as follows. Somehow get a sound file (a wav file if you're on windows): download it, rip it, record it – whatever. Next, this file needs to be converted to a format the NDS understands. Thirdly, we need to "get it in" in our program. The trick we use to link-in resources (as they are called today) is an old fashioned one (for me, it dates back to the turbo pascal 3.0 days – 1986). We convert the sound fragment to an object file and use the linker to link it in, in our program. Finally, we need to play the sound fragment, by adding some code; typically some calls to libnds functions.

So, ignoring the first step, we have three grounds to cover:

4.1. Converting sound so that the NDS can play it

The sound conversion tool of choice is SOX. SoX (Sound eXchange) is a command-line utility that can convert various sound formats into other formats. It can also apply various effects to these sound files during the conversion. It has many documented commandline features (the link just mentioned has nice color high-lighting, but misses some info. This link suffers from the reverse).

My first tries of playing sound failed. Not totally; I did hear something. I could even recognize the original fragment. But it was horrible. The issues appeared to be simple. Most PC audio files contain a header that describes the format of the fragment. But the nds format is raw, there is no header. So I fed the fragment to the nds sound chip, but I happened to have configured the chip for a different format.

What is this format business? With format I don't mean mp3  versus aif  here. No, a wav file contains non-encoded, non-compressed bare-metal samples. But still it has a header describing its format. The four key characteristics are:

So, when we have a sound fragment, we need to addapt it to the NDS's capabilities. The sample rate is pretty flexible, so this is just a matter of quality: the higher the rate you include: the better the sound and the more memory it eats. For the sample size and encoding, the choice is limited: PCM8, PCM16, IMA-ADPCM, PSG/Noise (see e.g. emubase). I only understand the former two, and from the SOX tool I learned that there is another choice: unsigned samples or signed (two's complement) samples. It seems that the NDS PCM samples are signed. Finally, the mono/stereo issue. The NDS has 16 channels, but I have no clue how to synchronize (start them exactly at the same moment) them, and that would be necesary for stereo. So I stick to mono.

So much for the conversion theory. I googled for a tik/tak sound and a chime. With a wave editor, I've cut the tik/tak sound in a tik and a tak fragment. You can download my sound sources if you want to play along. Sox is meant to convert a sound fragment, so it requires an in-file (and optionally in-options), and it requires an out-file (and optionally out-options). It also has some general options. Of the general options -V is an important one: it makes Sox verbose, informing us about the input (and output) formats. Another nice trick is to use -e instead of an output file: this instructs SOX to generate no output, but still report on the input formats (maybe -e stands for "examine mode").

I've used the examine mode on all of the three sources. My commands in red, higlighted in blue the key properties.

$ sox -V tik.wav -e
sox.exe: Detected file format type: wav

sox.exe: WAV Chunk fmt
sox.exe: WAV Chunk data
sox.exe: Reading Wave file: Microsoft PCM format, 1 channel, 22050 samp/sec
sox.exe:         44100 byte/sec, 2 block align, 16 bits/samp, 23254 data bytes
sox.exe:         11627 Samps/chans
sox.exe: Input file tik.wav: using sample rate 22050
         size shorts, encoding signed (2's complement), 1 channel

$ sox -V tak.wav -e
sox.exe: Detected file format type: wav

sox.exe: WAV Chunk fmt
sox.exe: WAV Chunk data
sox.exe: Reading Wave file: Microsoft PCM format, 1 channel, 22050 samp/sec
sox.exe:         44100 byte/sec, 2 block align, 16 bits/samp, 19366 data bytes
sox.exe:         9683 Samps/chans
sox.exe: Input file tak.wav: using sample rate 22050
         size shorts, encoding signed (2's complement), 1 channel

$ sox -V chime.wav -e
sox.exe: Detected file format type: wav

sox.exe: WAV Chunk fmt
sox.exe: WAV Chunk data
sox.exe: Reading Wave file: Microsoft PCM format, 1 channel, 11025 samp/sec
sox.exe:         11025 byte/sec, 1 block align, 8 bits/samp, 62182 data bytes
sox.exe:         62182 Samps/chans
sox.exe: Input file chime.wav: using sample rate 11025
         size bytes, encoding unsigned, 1 channel

Let's convert the wav's to a raw format. They all need to be signed since that's what the NDS hardware expects (the last wav is unsigned). I decided to keep the sample rates and to keep the sample size. There were no stereo fragments. So these are the commands I run. Note that the w option selects Word (16 bits) and the b option selects Byte (8 bits). The s option selects Signed. Finaly, the r option followed by the number indicates the sample Rate.

sox  tik.wav    -wsr 22050  tik.raw
sox  tak.wav    -wsr 22050  tak.raw
sox  chime.wav  -bsr 11025  chime.raw

4.2. Adding sound fragments to the makefile

We're done with getting sound fragments in the right format. Now, we have to link them in. There is no need to make any changes in the top-level makefile (with respect to the 3rd try). For the two sub makefiles we have to make a choice. The Arm7 has control of the sound hardware. So it makes sense to link-in the sound fragments in the Arm7 executable. However, the Arm7 has 64kB of memory (see dev-scene), so it fills up quite quickly (our 3 files sum up to over 100kB!). Another trick is to link them in with the Arm9, and send the address to the Arm7. I do not understand yet why this works. I assume that the sound fragment resides in the 4MB memory, which is shared with the Arm7. Anyhow, I linked the fragments in with the Arm9, and it works!

So, when we link the fragments in with the Arm9 executable, there is no need to make changes in the Arm7 makefile (with respect to the 3rd try). Funny enough, it is also not necesary to make any changes to the Arm9 makefile. Still I did, to separate the sound files from the rest: I made one change to an existing diversity parameter.

Near the top of the makefile, there is a comment that indentifies this diversity parameter

# DATA is a list of directories containing binary files

The raw sound files are binary files, so they somehow should be part of the DATA directories. I decided, inspired on the devkitPro examples, to create a single extra data directory next to the Arm9 source directory, to store all sound fragments.

The directory structure for watch4. Note the new data directory.

So I added the blue part

DATA    := data

As amazing as it may seem, this is all that is needed (in the makefiles)!

The makefile has a variable (BINFILES) that automatically collects all files from the directories mentioned in variable DATA (which in our case is the 3 files in directory data). Note that any file extension in the data directory is picked up.

BINFILES  :=	$(foreach dir,$(DATA),$(notdir $(wildcard $(dir)/*.*)))

The makefile has a variable collecting all object files: not only compiled assembler (.s) files, compiled c files (.c), or compiled c++ files (.cpp), but also all binary files with a .o appended.

export OFILES	:=  $(addsuffix .o,$(BINFILES)) \
          $(CPPFILES:.cpp=.o) $(CFILES:.c=.o) $(SFILES:.s=.o)

And the Arm9 executable is formed by linking all .o files (and the libraries).

$(ARM9ELF)  :  $(OFILES)
  @echo linking $(notdir $@)
  @$(LD)  $(LDFLAGS) $(OFILES) $(LIBPATHS) $(LIBS) -o $@

The only issue is the conversion from the files in data to .o files. If  that's done correctly, they will be linked in. At the end of the Arm9 makefile we find

# you need a rule like this for each extension you use as binary data 
%.bin.o : %.bin
  @echo $(notdir $<)

So, we either add "a rule like this for each extension you use as binary data", or make sure our data file have an extension for which we have a rule: .bin! I went for the latter, I renamed tik.raw to tik.bin, tak.raw to tak.bin and chime.raw to chime.bin.

I noticed that $(bin2o) is more then just $(bin2o). It's defined in the included base_rules (I've added new-lines in the listing below for clarity):

# canned command sequence for binary data 
define bin2o
  bin2s $< | $(AS) $(ARCH) -o $(@)
  echo "extern const u8"
       `(echo $(<F) | sed -e 's/^\([0-9]\)/_\1/' | tr . _)`
       `(echo $(<F) | tr . _)`.h
  echo "extern const u8"
       `(echo $(<F) | sed -e 's/^\([0-9]\)/_\1/' | tr . _)`
       `(echo $(<F) | tr . _)`.h
  echo "extern const u32"
       `(echo $(<F) | sed -e 's/^\([0-9]\)/_\1/' | tr . _)`
       `(echo $(<F) | tr . _)`.h

This code does two things. The first line creates the actual .o file. The next three lines generate a header file with the same name as the binary (with '.' replaced by '_') but .h appended. In our case it generates e.g. tik_bin.h for tik.bin (in arm9/build). The header file has three lines, defining the end address, the start address, and the size of the data. For example tik_bin.h has the following content.

extern const u8  tik_bin_end[];
extern const u8  tik_bin[];
extern const u32 tik_bin_size;

4.3. Watch 4 - Writing the code to play sounds

The generated header files are important, because they give us symbols (like tik_bin) that tell us the address of the data (the sound fragment). But for this to work, we need to include them. So we add the following lines to the top of our Arm9 c file.

#include "tik_bin.h"
#include "tak_bin.h"
#include "chime_bin.h"

Next, we have to prepare a structure with (a pointer to) the data and the format parameters. Take especially care of the red numbers. The blue ones are less relevant; the indicate 'balance'. I've chosen to let the 'tik' be mostly on the left, the 'tak' mostly on the right, and the 'chime' in the middle.

  TransferSoundData tik = {
    tik_bin,       /* address of raw sample */
    tik_bin_size , /* length of sample */
    22050,         /* sample rate */
    127,           /* volume: 0..127, muted to full */
    10,            /* panning (=balance): 0..127, left to right) */
    0              /* format: 1=8bit, 0=16bit */
  TransferSoundData tak = {
    tak_bin,       /* address of raw sample */
    tak_bin_size , /* length of sample */
    22050,         /* sample rate */
    127,           /* volume: 0..127, muted to full */
    117,           /* panning (=balance): 0..127, left to right) */
    0              /* format: 1=8bit, 0=16bit */
  TransferSoundData chime = {
    chime_bin,     /* address of raw sample */
    chime_bin_size,/* length of sample */
    11025,         /* sample rate */
    127,           /* volume: 0..127, muted to full */
    64,            /* panning (=balance): 0..127, left to right) */
    1              /* format: 1=8bit, 0=16bit */

The final thing is playing the sounds. I play the 'tik' on the even seconds and the 'tak' on the odd seconds. The 'chime' on every full minute. Bye the way, notice that the 'tik' and 'tik' play through the 'chime' due to the 16 channels of the NDS (the ndslib searches for a free channel, each time a playSound is issued). However, also notice that the very first 'tik' played alongside 'chime' is suppressed. That is a bug in our program: only one playSound can be IPCed to the Arm7 at a time... See below for a fix of that bug.

if( prevsec!=seconds ) playSound( seconds%2==0 ? &tik : &tak );

if( prevmin!=minutes ) playSound(&chime);

The full source (and binary) are available for download.

4.4. Sound theory (in four parts)

So, how come the playSound does its job? There are several parts to that. The complicating factor, is that the sound data needs to traverse from the Arm9 to the Arm7. The route is as follows.

Part 1: Setting up the data So, how come the playSound does its job? There are several parts to that. The first part is describing the sound fragment on the Arm9. This is done with a variable of type TransferSoundData.

TransferSoundData tik = {
  tik_bin,       /* address of raw sample */
  tik_bin_size , /* length of sample */
  22050,         /* sample rate */
  127,           /* volume: 0..127, muted to full */
  10,            /* panning (=balance): 0..127, left to right) */
  0              /* format: 1=8bit, 0=16bit */

Actually, TransferSoundData is struct sTransferSoundData as can be found in devkitPro\libnds\include\nds\ipc.h.

typedef struct sTransferSoundData {
  const void *data;
  u32 len;
  u32 rate;
  u8 vol;
  u8 pan;
  u8 format;
} TransferSoundData, * pTransferSoundData;

This struct holds the real data (data), as well as all meta data (len, rate, and format), some rendering data (vol, pan), and finally padding (PADDING) to make the struct a multiple of 4 bytes.

Note also that the ipc.h file contains a second sound related struct. It is called TransferSound (or struct sTransferSound).

typedef struct sTransferSound {
  TransferSoundData data[16];
  u8 count;
  u8 PADDING[3];
} TransferSound, * pTransferSound;

At first I was very confused by this struct, but later I realized its purpose. Where a TransferSoundData describes a single sound fragment, TransferSound describes (a maximum of) 16 fragments to be transferred from the Arm9 to the Arm7. The figure 16 is hardcoded, because the Arm7 sound chip has 16 channels. the count field is there to record how many sound fragments are actually transferred (in our case, this is 1).

Part 1: From Arm9 to Arm7 As we saw in our program, a sound is played by calling e.g. playSound(&tik) on the Arm9. This function is declared in devkitPro\libnds\include\nds\arm9\sound.h.

But I want to see the implementation. For this we open up libnds-src-20071023\source\arm9\sound.c (you do need to download the libnds sources if you want to do this yourself).

static TransferSound Snd;

void playSound( pTransferSoundData sound ) {
  Snd.count = 1;
  memcpy( &[0], sound, sizeof(TransferSoundData) );

As we can see, the passed sound fragment sound is copied (memcpy) to[0]. Variable Snd is actually a static buffer holding (a maximum of) 16 sound fragments (to transfer to the Arm7). We only copy one fragment, the count is therefore set to 1. The function falls through to playSoundBlock.

The function playSoundBlock is also implemented in sound.c. It does two things. Firstly, it forces a cache write-through with the function DC_FlushRange of snd (which, due to the call of playSound holds the global variable Snd – remember that C is case sensitive). Secondly, it puts the address of snd (that is, Snd) in the IPC struct IPC->soundData = snd.

static void playSoundBlock( TransferSound *snd ) {
  DC_FlushRange( snd, sizeof(TransferSound) );
  IPC->soundData = snd;

You might be wondering about the DC_FlushRange. It stands for 'data-cache flush memory range'. The sound data (Snd) is in shared memory, so Arm9 can write it and Arm7 can read it. However, when Arm9 writes it, it ends up in the data cache of the Arm9, not necesarily in the real memory. That only happens when the cache needs to be used for other data. And when Snd is in the cache but not yet in the real memory, the Arm7 reads garbage. That's why we need to force a cache flush.

The function DC_FlushRange is implemented in dcache.s (also part of the libnds sources). It is a loop, written in assembler, to flush a cache line at a time. Actually, the cache is not part of the Arm core, it is a peripheral device, but a close one, known as a coprocessor. The arm uses an mcr instruction (Move to Co-processor from arm Register) to give it instructions (sort of memory mapped I/O).

DC_FlushRange:                        [r0=start addr, r1=size] 
  add r1, r1, r0                      r1=r1+r0 [r0=start addr, r1=endaddr] 
  bic r0, r0, #(CACHE_LINE_SIZE - 1)  r0=r0 and not SIZE [r0 is rounded down] 
  mcr p15, 0, r0, c7, c14, 1          coprocessor action: clean and flush address 
  add r0, r0, #CACHE_LINE_SIZE        r0=r0+SIZE [r0 is next cache line] 
  cmp r0, r1                          flags=r0-r1 [Z ~ r0==r1] 
  blt .flush                         if (r0) less than (r1) branch to .flush 
  bx  lr                              sets pc to lr (lr=r14, pc=r15) [return] 

By the way, the IPC struct itself is also in shared memory, but in a special region, namely uncached shared memory (see later on). So by writing the pointer (Snd) to IPC->soundData, the Arm7 can read the pointer. But when Arm7 dereferences the pointer, it ends up in the Snd structure which must be explicitly cache flushed. You should be wondering about one more "cache" issue. We now know that[0] is flushed. However[0].data points to the PCM data. Shouldn't that be flushed? No! This data is part of the code, it is converted to object format and linked-in by the linker. The loader puts it into memory. See the following diagram for an overview.

The IPC struct (in uncached memory) with a filed pointing to the Snd buffer, with a pointer to the binary data (tik_bin).

So, with DC_FlushRange the entire TransferSound block (all data for 16 sound fragments) is flushed from cache to the real (shared) memory. Its pointer is written to uncached (shared) memory: a field of the IPC structure. The IPC structure (inter process communication) is picked up by the Arm7. The next section explains what the Arm7 does with it.

Part 3: Arm7 starts playing In our Arm7 code (so not the standard devkitPro Arm7 code from december 2007), there is a VblankHandler. This is an ISR (interrupt service routine) which executes every vertical blanking period (60Hz).

The main function of the Arm7 sets up the 'vblank' interrupt. By the way, irqSet adds the (pointer to the) function VblankHandler to an array (irqTable). The assembler routine IntrMain (in interruptDispatcher.s) looks up the function pointer when an interrupt occurs and jumps to it.

int main(int argc, char ** argv) {
  IPC->soundData = 0; // see below why this is important 
  irqSet( IRQ_VBLANK, VblankHandler );
  irqEnable(IRQ_VBLANK ... );

The VblankHandler ISR (function) is written (well, stolen) by us; it's also part of the Arm7 source file.

void VblankHandler(void) {
  u32 i;
  TransferSound *snd = IPC->soundData;
  IPC->soundData = 0;
  if( 0!=snd ) {
    for( i=0; i<snd->count; i++ ) {
      s32 chan = getFreeSoundChannel();
      if( chan>=0 ) {
        startSound(snd->data[i].rate, snd->data[i].data, snd->data[i].len,
          chan, snd->data[i].vol, snd->data[i].pan, snd->data[i].format);

The function VblankHandler is actually quite understandable. It first retrieves the pointer to the sound transfer block and saves that in the local variable snd (snd=IPC->soundData). Then the IPC sound transfer block pointer is cleared (IPC->soundData=0). This is important, because it ensures that every IPCed set of fragments is only played once. This is also the reason why Arm7's main function did set it to 0 (see above).

If the pointer was not zero, there is a freshly IPCed sound transfer block. The for loops over all passed sound fragments (snd->count); in our case there's always exactly 1 fragment. Then the hardware is queried to see which of the 16 channels is not playing any sound at the moment.

s32 getFreeSoundChannel() {
  int i;
  for( i=0; i<16; i++ ) {
    if( (SCHANNEL_CR(i) & SCHANNEL_ENABLE) == 0 ) return i;
  return -1;

That seems impressive, but it is just a question of looping over all sound channel control registers (SCHANNEL_CR, there are 16 of them, one for each channel) and seeing whether the're still busy. As nocash documents:

40004x0h - NDS7 - SOUNDxCNT - Sound Channel X Control Register (R/W)
  Bit0-6    Volume Mul   (0..127=silent..loud)
  Bit7      Not used     (always zero)
  Bit8-9    Volume Div   (0=Normal, 1=Div2, 2=Div4, 3=Div16)
  Bit10-14  Not used     (always zero)
  Bit15     Hold         (0=Normal, 1=Hold last sample after one-shot sound)
  Bit16-22  Panning      (0..127=left..right) (64=half volume on both speakers)
  Bit23     Not used     (always zero)
  Bit24-26  Wave Duty    (0..7) ;HIGH=(N+1)*12.5%, LOW=(7-N)*12.5% (PSG only)
  Bit27-28  Repeat Mode  (0=Manual, 1=Loop Infinite, 2=One-Shot, 3=Prohibited)
  Bit29-30  Format       (0=PCM8, 1=PCM16, 2=IMA-ADPCM, 3=PSG/Noise)
  Bit31     Start/Status (0=Stop, 1=Start/Busy)

And indeed, in devkitPro\libnds\include\nds\arm7\audio.h we find, amonst others

#define SCHANNEL_ENABLE    BIT(31)
#define SCHANNEL_CR(n)     (*(vuint32*)(0x04000400 + ((n)<<4)))

But, let's go back to VblankHandler. When it has found a free channel, it uses that channel to play the sound fragment, by calling startSound (passing all the (real and meta) data of the sound fragment).

void startSound(int sampleRate, const void* data, u32 bytes,
                 u8 channel, u8 vol, u8 pan, u8 format) {
  SCHANNEL_TIMER(channel)  = SOUND_FREQ(sampleRate);
  SCHANNEL_SOURCE(channel) = (u32)data;
  SCHANNEL_LENGTH(channel) = bytes >> 2 ;
                             SOUND_PAN(pan) | (format==1?SOUND_8BIT:SOUND_16BIT);

The startSound function just sets the hardware registers associated with the channel directly, thereby controlling the sound hardware (memory mapped I/O).

Part 4: What about IPC I don't know about you, but I'm still wondering about the IPC structure. It is defined in devkitPro\libnds\include\nds\ipc.h as follows

static inline TransferRegion volatile * getIPC() {
  return (TransferRegion volatile *)(0x027FF000);

#define IPC     getIPC

So, IPC is a macro that maps to an inline function getIPC that returns a pointer to 0x027FF000. Wondering what this address is about? At DSTek we find the memory maps of the Arm7 and Arm9.

ARM9 General Internal Memory
0000:0000-0000:7FFF	   ITCM (32KBytes)
0200:0000-023F:FFFF	   Main Memory (4MBytes)
037F:8000-037F:FFFF	   Shared IWRAM (siwram) (32KBytes Max.)
0400:0000-0400:????	   I/O RAM (??KBytes)
FFFF:0000-FFFF:7FFF	   BIOS (32KBytes)
----RELOCATABLE----	   DTCM (16KBytes)
----RELOCATABLE----	   Instruction Cache (8KBytes)
----RELOCATABLE----	   Data Cache (4KBytes)
--------N/A--------	   Write Buffer (32Bytes x 16 FIFO)

ARM7 General Internal Memory
0000:0000-0000:3FFF	   BIOS (16KBytes)
0200:0000-023F:FFFF	   Main Memory (4MBytes)
037F:8000-037F:FFFF	   Shared IWRAM (siwram) (32KBytes Max.)
0380:0000-0380:FFFF	   Exclusive IWRAM (eiwram) (64KBytes)
0400:0000-0400:????	   I/O RAM (??KBytes)

We also read "Main memory, consisting of one big block of 4MB memory, can be accessed by both CPU's. However, only one CPU can read/write/execute from it at a time. When both CPUs are trying to read main memory, one will have priority over the other." I still do not completely understand why there are no hickups in the sound. The Arm9 is constantly reading memory (operand fetches) blocking the sound processor, causing hickups. Or would the sound processor have a buffer?. But somewhere I read something that hints to an explanation: Both the ARM7 and the ARM9 can access this [the main] memory at any time. Any bus conflicts are delegated to the processor which has priority (the ARM7 by default but changeable via a control register) causing the other processor to wait until the first has finished its operation.

Anyhow, the main memory is shared. And it happens to be at the same location 0200:0000 for both processors. In other words, if the Arm9 writes something at say 0200:0001, the Arm9 could read it at that same address. That is, after the cache is flushed.

There is another remark at the DSTek page: "Main Memory from 0200:0000-023F:FFFF is mirrored in 0240:0000-027F:FFFF, 0280:0000-02BF:FFFF, and 02C0:0000-02FF:FFFF." (red part by Maarten Pennings). So, the IPC struct is located in the first mirror: the IPC address 0x027F:F000 is near the end of the range 0240:0000-027F:FFFF. What the web page doesn't tell you is that this mirror is uncached !

In case you wonder where the address for IPC comes from, I think it is just "laziness". The devkitPro developers shaved 4k of the top of the 4M of shared memory, so that the c compiler would never use it (the C compiler doesn't know it exists, so it will never allocate variables in the last 4k). The 4k block is used to allocate the IPC struct manually.

Want to check for yourself? in devkitPro\devkitARM\arm-eabi\lib\ds_arm9.ld we find

ewram	: ORIGIN = 0x02000000, LENGTH = 4M - 4k

If we start computing, we get

  0200:0000   = ORIGIN main memory
 +  40:0000   = 4M (LENGTH)
 -     1000   = 4k (LENGTH correction)
  023F:F000   = Start 'manual area'
 +  40:0000   = Offset for mirror
  027F:F000   = Start 'manual area' mirror

In other words, main memory (for the c compiler) is from 0200:0000 up to but excluding 023F:F000. Its mirror is 0240:0000 up to but excluding 027F:F000. As a result, the IPC struct starts precisly after the compiler usuable main memory, in the first uncached mirror! See diagram below for a visual rendering.

Part of the Arm9 memory map (main memory and its mirrors), zooming in on the main memory (middle) and on the first mirror (bottom).

Wow, we're done.

4.5. Multiple sounds at the same time and to fixed channels

Recall that we had a sound playing bug in watch4. At the whole minute two conditions hold: prevsec!=seconds as well as prevmin!=minutes. As a result, there are two calls to playSound at the whole minute: one for 'tik' and one for 'chime'. As we now know, playSound only assigns a pointer to a single globale variable shared with the Arm7 (IPC->soundData).

In other words, the Arm9 sets IPC->soundData first with 'tik' and then with 'chime', leaving too little time in between to allow the Arm7 to read it. It is not a coincidence that Arm7 never sees the 'tik'. The reason for this is that both the processors are "slaving" on the vertical blanking interval. The Arm7 explicitly picks up the sound command in VblankHandler, which is the ISR registered for handling the vertical blanking interrupt. The Arm9 sending of the sound commands is implicitly synchronized with the vertical blanking interval.

int main(void) {
  ... init ...
  while(1) {
    ... drawing of clock hands ...
    if( prevmin!=minutes ) playSound(&chime);

As the code demonstrates, main consists of an infinite while loop that has a glFlush call. And, as videoGL.h explains, it "Waits for a Vblank and swaps the buffers(like swiWaitForVBlank)".

Conclusion: there is no realiable way to use playSound to get two fragments to the Arm7. However, as we saw before, the mechanism (that is, the buffer TransferSound, and the handler at the Arm7 side) is available to transfer multiple (up to 16, the number of channels) fragments from the Arm9 to the Arm7. The problem is on the sending side: playSound only sends one fragment.

I solved the problem by writing my own playSoundsEx that can send three fragments. Why 3? I only need 3 at the moment. General would be to support 16. Having 16 arguments for playSound is a bit overdone. Supporting varargs for this tutorial is also a bit overdone.

The first version gives a rough idea, but two more revisions will follow!

TransferSound SndEx;

void playSoundsEx(pTransferSoundData s0,pTransferSoundData s1,pTransferSoundData s2) {
  SndEx.count = 3;
  memcpy( &[0], s0, sizeof(TransferSoundData) );
  memcpy( &[1], s1, sizeof(TransferSoundData) );
  memcpy( &[2], s2, sizeof(TransferSoundData) );
  DC_FlushRange( &SndEx, sizeof(TransferSound) );
  IPC->soundData = &SndEx;

There is one other improvement I wanted to make to the watch: I wanted it to tell the time. So, when it is 4:10 and the user touches the screen the NDS should say "It is ten past four". The big picture (the overal design) here is that we have several sound fragments, amongst others for "It is", "ten", "past", and "four". These fragments are stitched together to form a sentence.

The biggest hurdle to take is the stitching together of the sound fragments. What we need is to start playing "ten" when "It is" has just finished. Fortunately, the sound hardware tells us when it is done playing a sound fragment. As we saw before bit 31 of the sound control register for a channel tells whether the channel is stopped or busy.

This means that we can check whether a channel is done. Consequently, the Arm9 needs to know which channel to check. There is a problem: the current Arm7 code choses by itself which channel will play an incoming request from the Arm9 (using getFreeSoundChannel).

I decided to replace the getFreeSoundChannel call from the VblankHandler in the Arm7 code by a fixed channel assignment. I map request in slot data[i] of the TransferSound to channel i.

void VblankHandler(void) {
  u32 i;
  TransferSound *snd = IPC->soundData;
  IPC->soundData = 0; // Flag (to our own next Vblank invocation) that we did startSound 

  if( 0!=snd ) {
    if( snd->count>16 ) snd->count=16; // buffer overflow protection 
    for( i=0; i<snd->count; i++ ) {
      s32 chan = i; // new style: TransferSound at index i is for sound channel i 
      if( snd->data[i].len>0 ) {
        startSound(snd->data[i].rate, snd->data[i].data, snd->data[i].len, chan,
          snd->data[i].vol, snd->data[i].pan, snd->data[i].format);

The blue "if" (snd->data[i].len>0) is easily explained. Our new protocol maps fragment i to channel i. But what to do when we want to play something on channel 0 and channel 2, but not on 1? My solution: send a sound fragment for channel 1, but give it length 0. That's what the blue if covers.

This also needs to be fixed at the sending part (red part in playSoundsEx below). Our next revision of playSoundsEx has the following form (but the function needs one more fix).

void playSoundsEx(pTransferSoundData s0,pTransferSoundData s1,pTransferSoundData s2) {
  // Always sends three sounds, but when a sound is NULL, its len is set to 0
  SndEx.count = 3;
  if(s0==NULL)[0].len=0; else memcpy(&[0],s0,sizeof(TransferSoundData));
  if(s1==NULL)[1].len=0; else memcpy(&[1],s1,sizeof(TransferSoundData));
  if(s2==NULL)[2].len=0; else memcpy(&[2],s2,sizeof(TransferSoundData));
  DC_FlushRange( &SndEx, sizeof(TransferSound) );
  IPC->soundData = &SndEx;

We now have a playSoundsEx(s0,s1,s2) that can send up to three fragments from the Arm9 to the Arm7. Furthermore, s0 is mapped to channel 0, s1 is mapped to channel 1, and s2 is mapped to channel 2. And by setting s0, s1, and/or s2 to NULL, it's skipped. In watch5 I'm going to use the following channel assignment: channel 0 for 'tik' and 'tak', channel 1 for 'chime' and channel 2 for time telling.

4.6. Synchronizing sound commands between Arm9 and Arm7

I thought I was done with the hard part. We can send a sound fragment to a fixed channel. All we have to do, is make the busy flags for the channels available to the Arm9. I added the following code to the VcountHandler of the Arm7:

uint16 sndBusy;
// Get the busy flag of all sound channels
for( i=0; i<16; i++ )

Recall that make BIT(i) expands to an integer with only bit i set. Also recall that x|=y means x= x|y (where | means bit-wise or). The variable ARM7_BUSYBITS is a field in the IPC struct. I decided to misuse the aux field for now.

#define ARM7_BUSYBITS    (IPC->aux)
The only thing we need to do after sending "It is" on channel 2, is wait till channel 2 is no longer busy (checking ARM7_BUSYBITS & BIT(2)) and then send "ten" (then "past", then "four").

This was a big miscalculation!

The problem is that 'sending' the command to play "It is" is an action by the Arm9. It takes some time (one frame at 60Hz?) before the Arm7 ISR receives the the command and executes it. Then it takes again some time (another frame?) before the Arm7 "IPC interrupt" (VcountHandler) computes the ARM7_BUSYBITS. Only then, the Arm9 sees that the hardware is busy playing the sound. But it the mean time it has already concluded no sound was playing on the Arm7, so it sends "ten" before "It is" has even started!

So, we need to synchronize the Arm9 and Arm7 code.

Ingredients of the problem is that I wanted a bit for things like "Arm9 has send" (not a whole 4 byte word used as a single boolean). Bit since x|=BIT(2) is probably not a single machine instruction, this means that we have to restrict ourselves to having a single writer for the bit vector.

This is the protocol I devised.

The Arm9 sends sound commands to the Arm7. It takes some time for the Arm7 to catch up with the Arm9 (asynchronous). Each processor records a bit (per channel) indicating the execution of the command. The Arm9 records them in ARM9_ALTBITS, and the Arm7 in ARM7_ALTBITS.

So, typically, first the Arm9 issues the command (signalling this in ARM9_ALTBITS), and after some time, the Arm7 picks it up (signalling this in ARM7_ALTBITS).

As noted above: each processor is the only writer to that variable, but the ARM7_ALTBITS is also read by the Arm9. This means that the ARM7_ALTBITS needs to be part of an IPC mechanism, I decided to misuse the batery field of the IPC struct for that.

I first had the idea of setting the bit high when the command was executed. But this requires the Arm9 signalling back to the Arm7 that it has seen ARM7_ALTBITS being set (I say "signalling" because the Arm9 can not just clear a bit in ARM7_ALTBITS because that bitvector should have only one writer (which is Arm7) because it has multiple bits). So I decided to use an alternating bit protocol (hence the name ALTBITS).

Initially, the bit (for a channel) on the Arm9 is set the same as on the Arm7. When the Arm9 sends a command, it flips its local bit. When the Arm7 has received and executed the command it also flips its local bit. So, when the bits are equal, the Arm7 has caught up with the Arm9.

There is another way of looking at this. Suppose we would have a counter for each channel at the Arm9 and the Arm7 side. The counter at the Arm9 side would indicate how many sound commands have been send, the counter at the Arm7 side indicates how many sound commands have been written to the sound chip. So, when the Arm9 counter equals the Arm7 counter, the Arm7 has caught up with the Arm9. So, how big should these counters be? We could make them one byte, and have them wrap around at 255 to 0 again. But since the Arm9 counter is never more than 1 ahead of the Arm7 counter, a one byte counter is overkill. A 4 bit counter (wrapping at 15 to 0) also suffices. But, wait, even a 1 bit counter with wrap around suffices. That's called an alternating bit.

Note however, that "executing" means instructing the HW to play the sound fragment. Since playing a sound fragment may take an arbitrary long time (depending on the size of the PCM block), the Arm7 also tells whether it is still playing the fragment (ARM7_BUSY).

Implementing this protocol on the Arm7 side means:
(1) declaring the flags

#define ARM7_ALTBITS     (IPC->battery         )
#define ARM7_ALTBIT(n)   (IPC->battery & BIT(n))
#define ARM7_BUSYBITS    (IPC->aux             )
#define ARM7_BUSYBIT(n)  (IPC->aux     & BIT(n))

(2) Initializing the flags

int main(int argc, char ** argv) {
  // Arm9 signalling
  ARM7_BUSYBITS = 0; // Arm7 is busy with none of the channels
  ARM7_ALTBITS  = 0; // Arbitrary 'alternating bit' settings -- must be the same on Arm9

(3) Controlling the flags

void startSound(int sampleRate,const void* data,u32 bytes,u8 channel,u8 vol,u8 pan,u8 format) {
  SCHANNEL_TIMER(channel)  = SOUND_FREQ(sampleRate);
  SCHANNEL_SOURCE(channel) = (u32)data;
  SCHANNEL_LENGTH(channel) = bytes >> 2;
  SCHANNEL_CR(channel)     = SCHANNEL_ENABLE | ...;
  // Arm9 signalling (BUSYBITS is using high/low, ALTBITS is alternating
  ARM7_BUSYBITS |= BIT(channel); // Arm7 is busy on 'channel'
  ARM7_ALTBITS  ^= BIT(channel); // Signal that Arm7 processed 'Arm9 start sound command'

The order of the red assignment is non-trivial. The hardware is set to work (SCHANNEL_CR), before raising the BUSYBIT. If we would reverse this, this routine raises the BUSYBIT before the hardware is started, so if the VcountHandler ISR occurs in the middle, it would clear the BUSYBIT again. Secondly, the BUSYBIT is raised before the ALTBIT is toggled. If we would reverse this, the Arm9 might see an executed 'start sound' (busy bits on Arm7 and Arm9 are equal), with a low BUSYBIT, leading to the conclusion that the hardware has finished playing the sound.

Implementing this protocol on the Arm9 side means:
(1) declaring the flags (recall that the Arm7 BUSYBIT are read by the Arm9).

uint16 AltBits;
#define ARM9_ALTBITS     (AltBits              )
#define ARM9_ALTBIT(n)   (AltBits      & BIT(n))
#define ARM7_ALTBITS     (IPC->battery         )
#define ARM7_ALTBIT(n)   (IPC->battery & BIT(n))

#define ARM7_BUSYBITS    (IPC->aux             )
#define ARM7_BUSYBIT(n)  (IPC->aux     & BIT(n))

(2) Initializing the flags

int main(void) {
  // Set current status of alternating bits
  ARM9_ALTBITS= 0; // Arbitrary settings for all channels (all 0) -- must be the same on Arm7

(3) Controlling the flags (third and final version of playSoundsEx: red part added)

void playSoundsEx(pTransferSoundData s0,pTransferSoundData s1,pTransferSoundData s2) {
  // Always sends three sounds, but when a sound is NULL, its len is set to 0
  SndEx.count = 3;
  if(s0==NULL)[0].len=0; else memcpy(&[0],s0,sizeof(TransferSoundData));
  if(s1==NULL)[1].len=0; else memcpy(&[1],s1,sizeof(TransferSoundData));
  if(s2==NULL)[2].len=0; else memcpy(&[2],s2,sizeof(TransferSoundData));
  DC_FlushRange( &SndEx, sizeof(TransferSound) );
  // Record in ARM9_ALTBITS which commands we have send out to the Arm7
  if( s0!=NULL ) ARM9_ALTBITS ^= BIT(0);
  if( s1!=NULL ) ARM9_ALTBITS ^= BIT(1);
  if( s2!=NULL ) ARM9_ALTBITS ^= BIT(2);
  IPC->soundData = &SndEx;

(4) Using the flags to actually decide when to send the next sound fragment (sending "ten" only when "It is" is done playing, that is when CHANNEL_FREE(2) holds).

#define CHANNEL_FREE(n)      (!CHANNEL_REQUEST(n) && !ARM7_BUSYBIT(n))

4.7. Watch 5 - Talking time

We're now done with multisound architecture, let's now implement talking time.

Part 1: Phrases We need to decide which phrases the watch should say. I'm sorry to say that I decided to make a Dutch clock, but with some imagination you should be able to convert it to English. The Dutch clock phrases are very similar to the English phrases. The Dutch say (the hard parts are in the hh:16 .. hh:44 range (red)).

timeDutch EnglishTranslation
4:00Het is 4 uur It is 4 o'clock
4:10Het is 10 over 4 It is 10 past 4
4:15Het is kwart over 4 It is quarter past 4
4:20Het is 10 voor half 5 It is 20 past 4"10 to half to 5"
4:30Het is half 5 It is half past 4"half to 5"
4:40Het is 10 over half 5 It is 20 to 5"10 past half to 5"
4:45Het is kwart voor 5 It is quarter to 5
4:50Het is 10 voor 5 It is 10 to 5

Graphically, the Dutch phrases could be partitioned on the minutes in eight groups as follows.

On the left (yellow on blue) the different sentences for telling time (8 categories, classified on minutes).
On the right (green) which sound fragments we need.

I let my 6 year-old son read all the "green" fragments and recorded them on my PC. Next, I used Audacity to normalize the volume, suppress noise, and cut them in pieces. Next, I used Sox to convert them to raw format.

Part 2: Fragment administration We now have 20 fragments plus the three we had ('tik', 'tak', and 'chime'). Recall that for all three fragments we have to get the symbols in (23 times a #include), and we have to initialize the sound structs (23 times filling a TransferSoundData with the right values for sample frequency or 8/16 bits format). On top of that, we have another issue. Fragments like 'tik' or 'over' will be referred to by name (as in playSound(&tik) to play "tik"). But the fragments representing numbers (1..14) are much more conveniently referred to by index (as in playSound(int_to_TransferSoundData_pointer(10)) to play "ten"). As a result, for 14 out of the 23 fragments we need such an indirection.

I've chosen to create an array aFragments of sound fragments (that is, its elements are of type TransferSoundData). Secondly, I will create an enum (eFragments) with symbolic names for the sound fragments. For example, when sound fragment "over" happens to be assigned to aFragments[17], the enum will have a tag FRAGMENT_over with value 17. This ensure we can say playSound(aFragments[FRAGMENT_over]). Thirdly, I will assign the sound fragments to the array such that for the sounds belonging to 1, 2, .., 14, they will end up in aFragments[1], aFragments[2], .., aFragments[14]. This ensure we can also say playSound(aFragments[14]).

Concludingly, this means that 3 times (symbols, init, and enum) we need to do someting for 23 sound fragments.
It's time to get organized.

I decided to create a single file (fragments.i) describing only the sound fragments, but all aspects of it: the "core" part of its file name, its rate, volume, pan and format. The file is straightforward:

FRAGMENT(hetis  , 11025 , 127,  64 , 1)
FRAGMENT(n01    , 11025 , 127,  64 , 1) // must be second, so it has offset 1! 
FRAGMENT(n02    , 11025 , 127,  64 , 1) // must be third, so it has...
FRAGMENT(n03    , 11025 , 127,  64 , 1)
FRAGMENT(n04    , 11025 , 127,  64 , 1)
FRAGMENT(n05    , 11025 , 127,  64 , 1)
FRAGMENT(n06    , 11025 , 127,  64 , 1)
FRAGMENT(n07    , 11025 , 127,  64 , 1)
FRAGMENT(n08    , 11025 , 127,  64 , 1)
FRAGMENT(n09    , 11025 , 127,  64 , 1)
FRAGMENT(n10    , 11025 , 127,  64 , 1)
FRAGMENT(n11    , 11025 , 127,  64 , 1)
FRAGMENT(n12    , 11025 , 127,  64 , 1)
FRAGMENT(n13    , 11025 , 127,  64 , 1)
FRAGMENT(n14    , 11025 , 127,  64 , 1)
FRAGMENT(kwart  , 11025 , 127,  64 , 1)
FRAGMENT(half   , 11025 , 127,  64 , 1)
FRAGMENT(uur    , 11025 , 127,  64 , 1)
FRAGMENT(over   , 11025 , 127,  64 , 1)
FRAGMENT(voor   , 11025 , 127,  64 , 1)
FRAGMENT(tik    , 22050 ,  80,  10 , 0)
FRAGMENT(tak    , 22050 ,  80, 117 , 0)
FRAGMENT(chime  , 11025 ,  80,  64 , 1)

The trick is that this file is included three times (for its symbols, init, and enum), each time with a different definition of the macro FRAGMENT. I'll give one example here: the generation of the enum. For the other example, see the sources.

#define FRAGMENT(id,rate,vol,pan,fmt) FRAGMENT_##id,
enum {
#include "fragments.i" // see fragments.c for explanation of mechanism 
} eFragments;

Observe that the first line indeed defines FRAGMENT. In this case, the macro forms an enum tag by concatenating (the red ##) the characters FRAGMENT with the value of argument id. Note also that the expansion of the FRAGMENT macro ends with a comma. The next line starts the actual enum definition. With the given macro definition, the third line (the #include) generates 23 lines of the form "FRAGMENT_xxx ," (comma separated enum tags). The forth line contains a sentinel (FRAGMENT_last). This serves two purposes: nicer syntax (having an identifier after the last comma) allthough C doesn't require this, and having a tag for the size of the aFragments array declaration.

The example expands to

enum {
} eFragments;

If you want to check for yourself, tell the c-compiler to stop after preprocessing:

arm-eabi-gcc  -E  fragments.h

Part 3: The queue When D-day happens for the "tik" or the "chime" we just issue a playSound. However, when D-day happens for talking time, we need to set aside a series of sound fragments. For this, I've written a first-in-first-out queue.

I will not explain in detail the implementation. The API of the queue is as follows

void               qInit ( void );                    // Create an empty queue 
int                qEmpty( void );                    // Check if queue is empty 
void               qAdd  ( pTransferSoundData item ); // Add a fragment to the queue 
pTransferSoundData qGet  ();                          // Retrieve a fragment from the queue 

I've written one helper: it adds an entire time telling phrase to the queue, given hours and minutes. With the introduction of this chapter, the code should be clear even though it's Dutch. Note that I've defined two helper macros A and N as shorthands for adding a named fragment respectively a numeric fragment.

void qAddTime( int hours, int minutes ) {
#define A(id) qAdd( &aFragments[FRAGMENT_##id] )
#define N(i)  qAdd( &aFragments[i] )
  hours= hours % 12;
  if( minutes==0 ) {
    A(hetis); N(hours); A(uur);
  } else if( minutes<15 ) {
    A(hetis); N(minutes); A(over); N(hours);
  } else if( minutes==15 ) {
    A(hetis); A(kwart); A(over); N(hours);
  } else if( minutes<30 ) {
    A(hetis); N(30-minutes); A(voor); A(half); N(hours+1);
  } else if( minutes==30 ) {
    A(hetis); A(half); N(hours+1);
  } else if( minutes<45 ) {
    A(hetis); N(minutes-30); A(over); A(half); N(hours+1);
  } else if( minutes==45 ) {
    A(hetis); A(kwart); A(voor); N(hours+1);
  } else {
    A(hetis); N(60-minutes); A(voor); N(hours+1);

Part 4: The arm9 code We now have all ingredients for a watch than can "talk time". See the crucial parts of the main function listed below.

int main(void) {
  while(1) {
    pTransferSoundData snd0, snd1, snd2;
    int hours=   IPC->time.rtc.hours;
    int minutes= IPC->time.rtc.minutes;
    int seconds= IPC->time.rtc.seconds;

    // Draw hands and tickmarks 

    // Screen freshly touched? Yes: add a new request for time telling. 
    if( !prevtouched && TOUCHED ) {
      // Choice: only one time telling request is queued at a time 
      if( qEmpty() ) qAddTime(hours, minutes);

    snd0= snd1= snd2= NULL;
    // The while loops 60x per second. Are we at a full second? If yes play either tik or tak. 
    if( prevsec!=seconds ) snd0= seconds%2==0?&aFragments[FRAGMENT_tik]:&aFragments[FRAGMENT_tak];
    // Play a chime at "important" moments. 
    if( prevsec!=seconds && seconds%15==0 ) snd1=&aFragments[FRAGMENT_chime];
    // Do we have a pending request for time telling (queue not empty)? 
    if( !qEmpty() && CHANNEL_FREE(2) ) snd2=qGet();


The red part shows that a phrase is added when the screen is touched (if no phrase is being spoken), and that a fragment of the phrase is played when there still is a fragment (!qEmpty()) and when the "talk" channel (i.e. channel 2) is free (CHANNEL_FREE(2)). The blue parts emphasize the other sound fragments.

All sources (and the executable, and the original wav files) of watch5 are available for download. You will see that I've added several printfs to show what's going on (especially in the synchronisation area).

Part 5: Makefile goodies I've made four minor, but nice, changes to the makefile. First of all, I've removed .ds.gba as top-level target, in favor of .nds. I never use the gba stuff, so this saves a little time and disk space. Secondly, I've removed the .nds (and also .ds.gba) suffix from the clean target. This means that make clean deletes all intermediate files, but not the final executable. I did add a clobber target which does a clean and deletes the top level .nds target. Thirdly, I've made the nds file description more clever. See the red parts in the command below.

ndstool	-c $(TARGET).nds -7 $(TARGET).arm7 -9 $(TARGET).arm9 \
  -b logo.bmp $(TARGET)";Maarten Pennings;`date +'%Y %b %d (%H:%M)'`"

Fourtly, I've added an extra command for the top-level target. It creates a small .log file showing the (build target and) build time. Naturally, this .log is removed with make clobber.

echo -e "Target=$(TARGET)\nDate=`date +'%Y %b %d (%H:%M)'`\n" > $(TARGET).log

I don't see any challenges in the sound area for now. Documenting it has taken more lines than I expected. Let's move to the next challenge/chapter.

5. Adding keys to the watch

I've been so busy with getting "talking time" that I didn't want to upgrade. The problem was that my brother in law did upgrade devkitPro, and that the old and new devkitPro are not compatible, so we couldn't exchange sources anymore. So, before delving into a new subject (adding key handling to the watch), I first had to reinstall devkitPro.

5.1. Reinstall devkitPro

I've been using an older version of devkitPro, while a newer version was available. This section discusses an upgrade of watch5 to watch6, merely addapting it to the new release.

In theory, upgrading is simple. I already have the devkitProUpdater: I downloaded devkitProUpdater-1.4.4.exe when installing devkitPro. So, it is just a matter of double clicking it, to initiate an update.

Well, yes and no. Something happened, which I still do not completely understand. I've restarted devkitProUpdater a couple of times. Sometimes from the devkitPro directory, sometimes from the desktop (with a freshly downloaded devkitProUpdater), sometimes from the desktop, but with the existing devkitPro install tree deleted by me (except of course, the directory 'maarten', which I rescued). The funny thing that happened is that sometimes I got old stuff: e.g. devkitARM_r20-win32.exe and msys-1.0.10.exe; but sometimes I got the new stuff: devkitARM_r21-win32.exe and msys-1.0.11-RC2.exe.

One of the things that seemed to make a difference, was selecting 'keep' and 'remove' during the install wizzard. It seemed like that to me, but it is very likely not true.

Selecting 'keep' seems to download the new stuff, selecting 'remove' seems to download the old stuff (no, this does not make sense).

Alltough it is very likely not true this this radio button changed what was downloaded, changing it did help me in some respect: by 'keep'ing the downloaded files, it is very clear which versions are installed. The following files (middle column) were kept after some runs of devkitProUpdater:

ini filedownload dircomment
msys-1.0.10.exere-installed over 1.0.11
msys-1.0.11-RC2.exedoes not work for win 98

After some retries I succeeded in installing all the new stuff (left most column shows the ini file created by the upgrader). There was one problem though: msys-1.0.11-RC2.exe does not work in Win98. My suspicion is that some of the text files are saved as unix (I know that msys.bat had unix CR/LF, but converting only that file appears not enough). I solved this by double cliking msys-1.0.10.exe after installing all the new stuff.

By the way, this setup still suffers from the fact that msys.bat doesn't run. My brother in law found a better fix than changing the bat files as we did before: he changed the shortcurt!

The shortcut to msys, just after removing BIN from the working directory.

Be warned! The maintainers of devkitPro pay less attention to backward compatibility than I am used to. (also see previous changes I found).

5.2. Analysing the new devkitPro clock code

The new devkitPro has a rewrite of the clock code. What is this about?

The old Arm7 code used the VcountHandler ISR to read the time chip, and put that in the IPC struct. The following three lines of that 60 line ISR took care of that.

void VcountHandler() {
  uint8 ct[sizeof(IPC->time.curtime)];
  rtcGetTimeAndDate((uint8 *)ct);
  for( i=1; i<sizeof(ct); i++ ) {
    IPC->time.curtime[i]= ct[i-1];

The new Arm7 code uses a completely different mechanism: on startup, the software reads the time chip once and ISR that is called every second increments the time by one second. The main for Arm7 contains the call to initClockIRQ.

int main(int argc, char ** argv) {
  // Reset the clock if needed
  // Start the RTC tracking IRQ

The initClockIRQ can be found in libnds-src-20071023\source\arm7\clock.c. It performs two tasks: setting up an interrupt to increment the software clock (red), and initializing the software clock by reading the time chip (green). Note, most comments are by me (Maarten, based on nocash sire), they are not in the official source.

void initClockIRQ() {
  // Enables the NDS to receive clock interrupts 
  REG_RCNT = 0x8100;
  // Set up interrupt for software clock 
  irqSet(IRQ_NETWORK, syncRTC);
  // Program the clock chip for "Frequency steady interrupt" of "1Hz" 
  ... rtcTransaction(...); ...
  // Read all time settings on first start 
  rtcGetTimeAndDate((uint8 *)&(IPC->time.rtc.year));

  struct tm currentTime;
  currentTime.  ...  = IPC-> ... ;
  IPC->unixTime = mktime(¤tTime);

To my big surprise, time is now kept twice! Once as the well-known-to-us IPC->time (a struct with several independent fields). But a new addition to the IPC struct is the field vint32 unixTime (a single integer counting seconds). And as we can see from the blue part in initClockIRQ, that field is also set to the current time during startup (using a standard Unix(?) function mktime which converts the standard Unix(?) struct with independent fields struct tm to an integer).

The same c file (clock.c) also contains the ISR syncRTC that was installed by initClockIRQ. As we can see, this ISR just increments the seconds (++IPC->time.rtc.seconds) and upon overflow (i.e. when reaching 60) sets the seconds to zero and increments the minutes. We also see that when the day overflows, the actual clock chip is reread (my guess: incrementing days is too complex and the gain doesn't justify it). Also note that the unix time is incremented.

void syncRTC() {
  if( ++IPC->time.rtc.seconds==60 ) {
    IPC->time.rtc.seconds= 0;
    if( ++IPC->time.rtc.minutes==60 ) {
      IPC->time.rtc.minutes= 0;
      if( ++IPC->time.rtc.hours==24 ) {
        rtcGetTimeAndDate((uint8 *)&(IPC->time.rtc.year));

I haven't seen any reasoning for this change. But I think that reading the clock chip is relatively computation intensive (its writing and reading messages over a serial link to an off-chip peripheral) and thus power hungry, compared to just incrementing an integer.

I do wonder about the interrupt used for incrementing the time though. We saw it being set in initClockIRQ.

void initClockIRQ() {
  // Set up interrupt for software clock 
  irqSet( IRQ_NETWORK, syncRTC );

Checking devkitPro\libnds\include\nds\interrupts.h we find the following not so helpful comments (network? serial interrupt? what does it have to do with seconds?).

enum IRQ_MASKS {
  IRQ_VBLANK      = BIT(0),   /* vertical blank interrupt mask */
  IRQ_HBLANK      = BIT(1),   /* horizontal blank interrupt mask */
  IRQ_VCOUNT      = BIT(2),   /* vcount match interrupt mask */
  IRQ_NETWORK     = BIT(7),   /* serial interrupt mask */
  IRQ_ALL         = (~0)

This interrupt is not the best known one. It is absent at neimods dstek site. But nocash knows about it.

4000210h - NDS9/NDS7 - IE - 32bit - Interrupt Enable (R/W)
4000214h - NDS9/NDS7 - IF - 32bit - Interrupt Request Flags (R/W)
Bit 7     NDS7 only: SIO/RCNT/RTC (Real Time Clock)

There is one thing that I still don't understand. The new watch example devkitPro\examples\nds\RealTimeClock\Watch no longer reads the IPC->time.rtc, instead, it uses the Unix time() function (red) to get the time (and gmtime (blue) to convert it to a Unix struct).

time_t unixTime = time(NULL);
struct tm* timeStruct = gmtime((const time_t *)&unixTime);

hours = timeStruct->tm_hour;
minutes = timeStruct->tm_min;
seconds = timeStruct->tm_sec;

I still don't understand why using time() from Unix' #include <time.h> is better than using IPC->time. What's more, it took me a really long time to figure out how the function time, implemented by devkitArm, which is supposed to be NDS independent, knows that it has to read IPC->unixTime.

But after several find-in-files, I figured it out!

One of the libnds source files (libnds-src-20071023\source\arm9\initSystem.c) implements the function initSystem(), which is presumably called during start-up. This function sets a pointer punixTime to point to IPC->unixTime.


extern time_t *punixTime;

void initSystem(void) {
  punixTime= (time_t*)&IPC->unixTime;

And it happens that devkitPro\devkitARM\arm-eabi\lib\libsysbase.a refers to this pointer!

5.3. The theory behind keys

Let's now add keys to our application.

Most keys can be directly read by the Arm9. The file devkitPro\libnds\include\nds\system.h defines the macro REG_KEYINPUT as mapping to address 0x04000130, which is the key status read register. Note that the bits are 0 when the button is pressed.

04000130 - REG_KEYINPUT - Key Status - read only
bit 987 6 5 4 3 2 10
button LRdownupleftrightstartselectBA

In the above table, the hinge "button", the touch status, and the X and Y buttons are absent; they can not be accessed by the Arm9. The Arm7 accesses these four buttons via the macro REG_KEYXY mapping to 0x04000136. Note that the bits are 0 when the button is pressed respectively screen is touched, or lid is open.

04000136 - REG_KEYXY - Key X/Y Input - read only
bit 7 6 5 4 3 2 10
button lid touch YX

Fortunately, the Arm7 puts those four bits into the IPC struct.

void VcountHandler() {
  uint16 but=0;
  but = REG_KEYXY;
  IPC->buttons    = but;

Libnds unifies all the keys. The header file input.h defines an enum KEYPAD_BITS.

input.h - enum KEYPAD_BITS
bit 13 12 1110987 6 5 4 3 2 10

The unification is achieved, simply by or-ing the bits together (shifting appropriately). The source file libnds-src-20071023\source\arm9\keys.c shows how:

#define KEYS_CUR
    ( (~REG_KEYINPUT)&0x3ff                      ) // Take the 12 Arm9 bits (but flip) 
    ( ((~IPC->buttons)&3)<<10                    ) // Take the bit 0 and 1 of Arm7 (but flip) 
    ( ((~IPC->buttons)<<6) & (KEY_TOUCH|KEY_LID) ) // Take the bit 6 and 7 of Arm7 (but flip) 
  KEY_LID // Flip the lid bit (so 1 means closed now) 

The code in keys.c is very straightforward (ignoring the repeat feature).

static uint16 keys = 0;
static uint16 keysold = 0;

void scanKeys(void) {
  keysold = keys;
  keys = KEYS_CUR;

uint32 keysHeld(void) {
  return keys;

uint32 keysDown(void) {
  return (keys ^ keysold) & keys;

uint32 keysUp(void) {
  return (keys ^ keysold) & (~keys);

The design pattern is to have scanKeys in your main loop, and to check KeysDown for fresh presses (keysHeld also gives the keys that were already pressed during the previous loop). This is clearly demonstrated in the code a copied from devkitPro\examples\nds\input\touch_look\source\main.cpp.

while( 1 )

  if( keysDown() & (KEY_LEFT|KEY_Y) )
  if( keysDown() & (KEY_RIGHT|KEY_A) )
  if( keysDown() & KEY_TOUCH )

  // Push our original Matrix onto the stack (save state) 
  // Here's where we do all the drawing 
  // Pop our Matrix from the stack (restore state) 
  // Flush to screen 

I've added keys to enable/disable "tik"/"tak", to select when the "chime" should play (never, every quarter of a minute, every minute, every quarter of an hour, or every hour). Then I decided to look into controlling power management. Later I will explain how to do this. But watch6 does allow controlling the backlights and the system power! And it still talks. And it has a more 3D appearance, which can be rotated manually (and automatically).

Possible watch improvements

5.4. The touch screen

The touch screen with pixel coordinates (in red) and the raw coordinates (in black). Note that the raw coordinates "drift" a little: the left-hand side has a raw x-coordinate of about 200, but it is 208 at the bottom and 192 at the top. Be aware that the raw coordinates are NDS specific (you need to calibrate your screen), you're now looking at my raw coordinates.

#include "nds.h"
#include <stdio.h>

int main(void) {
  touchPosition touchXY;

  videoSetMode(0);  //not using the main screen
  videoSetModeSub(MODE_0_2D | DISPLAY_BG0_ACTIVE);  //sub bg 0 will be used to print text

  BG_PALETTE_SUB[255] = RGB15(31,31,31);  //by default font will be rendered with color 255
  consoleInitDefault((u16*)SCREEN_BASE_BLOCK_SUB(31), (u16*)CHAR_BASE_BLOCK_SUB(0), 16);
  iprintf("Touch screen demo\n");

  while(1) {
    iprintf("Touch  x,px = %04d, %04d\n", touchXY.x, touchXY.px);
    iprintf("Touch  y,py = %04d, %04d\n", touchXY.y,;
    iprintf("Touch z1,z2 = %04d, %04d\n", touchXY.z1, touchXY.z2);
  return 0;

6. Video architecture

This chapter tries to explain the NDS video architecture (which is sometimes also referred to as graphics architecture).

6.1. Video cores

The NDS has two LCD screens, referred to as the bottom screen and the top screen. Note that the bottom screen is the touch screen (it has a sensor for detecting touches with a pen).

Not surprisingly, the NDS has two graphics cores, referred to as the main graphics core and the sub graphics core. They are largely the same, but the main core has slightly more capabilities (more video modes, support for 3D). Note: the main core is also known as Engine A and the sub core as Engine B.

It is up to the programmer to chose whether the bottom screen is controlled by the main graphics core or the sub graphics core (the top screen being controlled by the other graphics core). The register REG_POWERCNT controls which core is attached to which screen. It also controls the power modes: the power for the LCD screens, the power for the main graphics core (even in three parts: its 2D part, its 3D part and its 3D matrix part), and the power for the sub graphics core. The following fragment from system.h illustrate this.

// Power control register.
// This register controls what hardware should be turned on or off.
#define	REG_POWERCNT	*(vu16*)0x4000304

#define POWER_LCD       BIT(0)  // Controls the power for both LCD screens.
#define POWER_2D_A      BIT(1)  // Controls the power for the main (or A) 2D core.
#define POWER_MATRIX    BIT(2)  // Controls the power for the 3D matrix.
#define POWER_3D_CORE   BIT(3)  // Controls the power for the main 3D core.
#define POWER_2D_B      BIT(9)  // Controls the power for the sub (or B) 2D core.
#define POWER_SWAP_LCDS BIT(15) // 0=Main on bottom, 1=main on top

// Enables power to all hardware required for 2D video.

// Enables power to all hardware required for 3D video.

// Switches the screens.
static inline void lcdSwap(void) { REG_POWERCNT ^= POWER_SWAP_LCDS; }

// Forces the main core to display on the top.
static inline void lcdMainOnTop(void) { REG_POWERCNT |= POWER_SWAP_LCDS; }

// Forces the main core to display on the bottom.
static inline void lcdMainOnBottom(void) { REG_POWERCNT &= ~POWER_SWAP_LCDS; }

So, the following call to power the graphics cores is often found at the start of a game based on (2D) tiles:


6.2. Video "layers"

I would say that a core has a notion of layers, the front ones obscuring the back ones, except for those pixels that are transparant. But that is not the terminology that is in use. Fortunately, the concept is.

The NDS hardware offers five "layers". They are called: the foreground sprites, background 0, background 1, background 2, and background 3. For the four background "layers" (sorry that I keep on using that word), there are multiple flavors: text (a strange word for general purpose tiles), rot (tiles with a transformation matrix, so that they can be rotated and scaled; also known as affine tiles), and extrot ("extended" version of rotation, but can also work in bitmap mode). By the way, the relative order of the background layers can be programmed (so layer 1 can be in-front-of or behind layer 3) via the priority field.

Please note that common terminology dictates that the word "background" is used for any "layer" except the "forground sprites layer".

6.3. Video modes

Both graphics core have a notion of a video mode. A video mode determines which flavor (text, rot, extrot) is used for each of the "layers". So, not only does the hardware determine that there are exactly five layers, it also limits the programmer in chosing which flavor to use for each layer. On the positive side, each layer can be individually enabled so you're not stuck with five layers.

Video modes for main (black and red) and sub (only black) core
0 spritestext or 3Dtexttext text
1 spritestext or 3Dtexttext rot
2 spritestext or 3Dtextrot rot
3 spritestext or 3Dtexttext extrot
4 spritestext or 3Dtextrot extrot
5 spritestext or 3Dtextextrotextrot

To select a mode, set the appropriate bits of register DISPLAY_CR (for the main graphics core) or SUB_DISPLAY_CR (for the sub graphics core). A quick look at dualis gives a lot of impressive details, but especially bits 0-2, 3, and 8-12 are important right now:

0x04000000 - DISPLAY_CR - Display control register
Bit   Explanation
---   -----------
0-2   BG mode (0-6=BG mode 0-6, 7=prohibited)
3     BG0 2D/3D (0=BG0 used for 2D, 1=BG0 used for 3D)
4     Character OBJ mapping mode (0=2D mapping, 1=1D mapping (see bit 20-21))
5-6   Bitmap OBJ mapping mode (0=128x128 bitmap, 1=256x64 bitmap, 2=1D bitmap (see bit 22), 3=prohibited)
7     Forced blank (1=Blank screen and allow access to VRAM)
8     Display BG0 (1=Display)
9     Display BG1
10    Display BG2
11    Display BG3
12    Display OBJ
13    Display window 0
14    Display window 1
15    Display OBJ window
16-17 Display mode (0=VRAM display (LCDC) mode, 1=BG mode, 2=prohibited, 3=Main RAM display mode)
18-19 VRAM selection (when LCDC mode is used)
20-21 Character OBJ extended mapping mode (when 1D char mode is used: 0=32kB capacity, 1=64kB, 2=128kB, 3=256KB)
22    Bitmap OBJ extended mapping mode (when 1D bitmap mode is used: 0=128kB bitmap, 1=256kB bitmap)
23    Allow OBJ VRAM access during h-blank
24-26 Master character offset (added to char base block. offset = n*64kB)
27-29 Master screen offset (added to screen base block. offset = n*64kB)
30    Extended BG palette master enable
31    Extended OBJ palette master enable

The good news is that there are masks in the video.h header file. Here is an excerpt.

// Display control registers
#define DISPLAY_CR       (*(vuint32*)0x04000000)
#define SUB_DISPLAY_CR   (*(vuint32*)0x04001000)

// General modes
#define MODE_0_2D      0x10000
#define MODE_1_2D      0x10001
#define MODE_2_2D      0x10002
#define MODE_3_2D      0x10003
#define MODE_4_2D      0x10004
#define MODE_5_2D      0x10005
#define MODE_6_2D      0x10006 // main only

// Enabling individual "layers"
#define DISPLAY_BG0_ACTIVE    (1 << 8)
#define DISPLAY_BG1_ACTIVE    (1 << 9)
#define DISPLAY_BG2_ACTIVE    (1 << 10)
#define DISPLAY_BG3_ACTIVE    (1 << 11)
#define DISPLAY_SPR_ACTIVE    (1 << 12)

// 3D for BG0 (main only)
#define ENABLE_3D    (1<<3)

#define MODE_0_3D    (MODE_0_2D | DISPLAY_BG0_ACTIVE | ENABLE_3D)
#define MODE_1_3D    (MODE_1_2D | DISPLAY_BG0_ACTIVE | ENABLE_3D)
#define MODE_2_3D    (MODE_2_2D | DISPLAY_BG0_ACTIVE | ENABLE_3D)
#define MODE_3_3D    (MODE_3_2D | DISPLAY_BG0_ACTIVE | ENABLE_3D)
#define MODE_4_3D    (MODE_4_2D | DISPLAY_BG0_ACTIVE | ENABLE_3D)
#define MODE_5_3D    (MODE_5_2D | DISPLAY_BG0_ACTIVE | ENABLE_3D)
#define MODE_6_3D    (MODE_6_2D | DISPLAY_BG0_ACTIVE | ENABLE_3D)

As a result, we can have two simple text backgrounds on the main core with the following setting:


Or, we can have a single text backgrounds on the main core and one on the sub core with the following setting:


Since the video.h header file also contains

static inline void videoSetMode   ( uint32 mode ) { DISPLAY_CR=      mode; }
static inline void videoSetModeSub( uint32 mode ) { SUB_DISPLAY_CR = mode; }

we see the second example quite often being programmed as follows

videoSetMode   ( MODE_0_2D | DISPLAY_BG0_ACTIVE );
videoSetModeSub( MODE_0_2D | DISPLAY_BG0_ACTIVE );

6.4. Video bases

With all the "layers" a lot of video data needs to be stored (palettes, tile-sets i.e. the bitmaps for each tile, tile-maps i.e. mapping each screen location to a tile, sprite bitmaps, etc). This is done in the video memory.

The table below (based on nocash's work) shows the systems memory map (of the ARM9), including the video memory.

ARM9 system memory map
0000:0000h  Instruction TCM (32KB) (not moveable) (mirror-able to 1000000h)
0xxx:x000h  Data TCM        (16KB) (moveable)
0200:0000h  Main Memory     (4MB)
0300:0000h  Shared WRAM     (0KB, 16KB, or 32KB can be allocated to ARM9)
0400:0000h  ARM9-I/O Ports
0500:0000h  Standard Palettes (2KB) (main core BG/OBJ, sub core BG/OBJ)
0600:0000h  VRAM - main core, BG VRAM  (max 512KB)
0620:0000h  VRAM - sub  core, BG VRAM  (max 128KB)
0640:0000h  VRAM - main core, OBJ VRAM (max 256KB)
0660:0000h  VRAM - sub  core, OBJ VRAM (max 128KB)
0680:0000h  VRAM - "LCDC"-allocated (max 656KB)
0700:0000h  OAM (2KB) (main, sub core)
0800:0000h  GBA Slot ROM (max. 32MB)
0A00:0000h  GBA Slot RAM (max. 64KB)
FFFF:0000h  ARM9-BIOS (32KB) (only 3K used)

From the above map we learn that the main core uses the memory starting at 0600:0000 (red) for backgrounds (and the memory starting at 0640:0000 for sprites). The sub core uses the memory at 0620:0000 (red) for backgrounds. The palette is located at 0500:0000 (blue).

Lets' get a bit more concrete and focus at a background of flavor 'text'. In this case we have tiles. Each tile is a grid of 8x8 pixels. And each pixel can have a variety of colors: either 16 (4 bits color depth) or 256 (8 bit color depth). So, in 256 color mode, each tile is takes up 8x8x1 or 64 bytes. Each byte is an index into the palette.

There are two modes to index tiles. One uses 8 bits for the index so that 256 tiles can be selected. In this mode a tile-set take 256x64 or 16k bytes. The other modes uses 10 bits for the index so that 1024 tiles can be selected. The tile-set then takes 1024x64 bytes or 64k byte.

A screen contains 32x24 tiles. However, the hardware has a larger so-called tile-map of 32x32 tiles (or even bigger like 64x64) and the screen "pans" over it. Each entry in the tile-map contains an index into the tile-set. There is a simpler (8 bit) index, but the 16 bit version is able to address the whole tile-set:


To recap, we have the following concepts (and typical sizes)

Next to setting the display control register DISPLAY_CR (selecting a video mode, and enabling backgrounds), we have to configure the individual backgrounds. The table below shows the control register for background 0 (from dualis).

0x04000008 - BG0_CR - BG0 control register
Bit   Explanation
---   -----------
0-1   Priority (0=highest...3=lowest)
2-5   Tile base block (individual character base offset. n*16kB)
6     Mosaic enable (1=On, 0=Off)
7     Color mode (0=16x16 palettes, 1=1x256 palettes)
8-12  Map base block (individual screen base offset. n*2kB)
13    Palette set 0/2 (extended palettes. 0=use set 0, 1=use set 2)
14-15 Screen size

Again, there are masks in the video.h header file. Here is an excerpt.

// Background control for main code
#define BG0_CR    (*(vuint16*)0x04000008)
#define BG1_CR    (*(vuint16*)0x0400000A)
#define BG2_CR    (*(vuint16*)0x0400000C)
#define BG3_CR    (*(vuint16*)0x0400000E)

// Background control for sub code
#define SUB_BG0_CR     (*(vuint16*)0x04001008)
#define SUB_BG1_CR     (*(vuint16*)0x0400100A)
#define SUB_BG2_CR     (*(vuint16*)0x0400100C)
#define SUB_BG3_CR     (*(vuint16*)0x0400100E)

// Color mode
#define BG_256_COLOR   (BIT(7))
#define BG_16_COLOR    (0)

// Priority
#define BG_PRIORITY(n) (n)
#define BG_PRIORITY_0  (0)
#define BG_PRIORITY_1  (1)
#define BG_PRIORITY_2  (2)
#define BG_PRIORITY_3  (3)

// Bases
#define BG_TILE_BASE(base) ((base) << 2)
#define BG_MAP_BASE(base)  ((base) << 8)

// Map size
#define BG_32x32       (0 << 14)
#define BG_64x32       (1 << 14)
#define BG_32x64       (2 << 14)
#define BG_64x64       (3 << 14)

// Palette select
#define BG_PALETTE_SLOT0 0
#define BG_PALETTE_SLOT1 0
#define BG_PALETTE_SLOT2 BIT(13)
#define BG_PALETTE_SLOT3 BIT(13)

The hard part of background control are BG_TILE_BASE and BG_MAP_BASE. These two parameters determine where a graphics core will look for the tile(-set) respectively the (tile-)map (of the background).

The BG_TILE_BASE is controlled with bits 2-5. Hence it can be a number 0-15. This number determines the memory base for the tile-set. Recall that a tile-set is 16k bytes decimal, which is 4000 hex. So, the tile bases are 0000, 4000, 8000, C000, 1:0000, 1:4000, etc. These are an offset to the start of the video memory (0600:0000 for the main core and 0620:0000 for the sub core).

Tile baseoffsetabs address mainabs address sub
0 0000 0600:0000 0620:0000
1 4000 0600:4000 0620:4000
2 8000 0600:8000 0620:8000
3 C000 0600:C000 0620:C000
4 1:0000 0601:0000 0621:0000
5 1:4000 0601:4000 0621:4000
6 1:8000 0601:8000 0621:8000
7 1:C000 0601:C000 0621:C000
8 2:0000 0602:0000 0622:0000
9 2:4000 0602:4000 0622:4000
10 2:8000 0602:8000 0622:8000
11 2:C000 0602:C000 0622:C000
12 3:0000 0603:0000 0623:0000
13 3:4000 0603:4000 0623:4000
14 3:8000 0603:8000 0623:8000
15 3:C000 0603:C000 0623:C000

Similarly, the BG_MAP_BASE is set with bits 8-12. Hence it can be a number from 0-31. This number determines the memory base for the tile-map. Recall that a tile-map takes 2k byte or 800 hex, so the offsets of the map bases are 0000, 0800, 1000, 1800, 2000, 2800, etc. Again, these are an offset to the start of the video memory.

Map base offsetabs address mainabs address sub
0 0000 0600:0000 0620:0000
1 0800 0600:0800 0620:0800
2 1000 0600:1000 0620:1000
3 1800 0600:1800 0620:1800
4 2000 0600:2000 0620:2000
5 2800 0600:2800 0620:2800
... ... ... ...
30 F000 0600:F000 0620:F000
31 F800 0600:F800 0620:F800

This means that a typical dual background set-up would do

BG0_CR= BG_32x32 | BG_COLOR_16 | BG_MAP_BASE(0) | BG_TILE_BASE(1);
BG1_CR= BG_32x32 | BG_COLOR_16 | BG_MAP_BASE(1) | BG_TILE_BASE(2);

Leading to the following video memory map (where red/blue colors are re-used to highlight the associated bases).

       tilebase         mapbase    allocated
_______________ _______________ ____________
nr       offset nr       offset
    0x4000=16kB       0x800=2kB
__ ____________ __ ____________ ____________
 0    0600:0000  0    0600:0000  MAP for BG0
                __ ____________ ____________
                 1    0600:0800  MAP for BG1
                __ ____________ ____________
                 2    0600:1000
                __ ____________
                 3    0600:1800
                __ ____________
                 4    0600:2000
                __ ____________
                 5    0600:2800
                __ ____________
                 6    0600:3000
                __ ____________
                 7    0600:3800
__ ____________ __ ____________ ____________
 1    0600:4000  8    0600:4000 TILE for BG0
                __ ____________
                 9    0600:4800
                __ ____________
                 A    0600:5000
                __ ____________
                 B    0600:5800
                __ ____________
                 C    0600:6000
                __ ____________
                 D    0600:6800
                __ ____________
                 E    0600:7000
                __ ____________
                 F    0600:7800
__ ____________ __ ____________ ____________
 2    0600:8000 10    0600:8000 TILE for BG1
                __ ____________
                11    0600:8800
                __ ____________
                12    0600:9000
                __ ____________
                13    0600:9800
                __ ____________
                14    0600:A000
                __ ____________
                15    0600:A800
                __ ____________
                16    0600:B000
                __ ____________
                17    0600:B800
__ ____________ __ ____________ ____________
      0600:C000       0600:C000

Suppose that we want to set-up an initial palette, tile-set and tile-map for a dual background example. Let's assume that we have three arrays bg0_tileset, bg0_tilemap, and bg0_palette for background 0 together with three integers holding their sizes (we have a similar set for background 1). We would have to write them to the correct locations:

Fortunately, video.h provides macros for the first two (note the sizes of the tile bases and map bases in the formula)

#define BG_TILE_RAM(base)     (((base)*0x4000) + 0x06000000)
#define BG_MAP_RAM(base)      (((base)*0x0800) + 0x06000000)

#define BG_TILE_RAM_SUB(base) (((base)*0x4000) + 0x06200000)
#define BG_MAP_RAM_SUB(base)  (((base)*0x0800) + 0x06200000)

and memory.h provides a macro for the last one

#define BG_PALETTE       ((uint16*)0x05000000)
#define BG_PALETTE_SUB   ((uint16*)0x05000400)

So, to set up background 0 and background 1 requires three memory copies each.

// Copying tileset, tilemap and palette for bg0
memcpy16( (void*)BG_TILE_RAM(1) , bg0_tileset, bg0_tileset_size );
memcpy16( (void*)BG_MAP_RAM(0)  , bg0_tilemap, bg0_tilemap_size );
memcpy16( BG_PALETTE     , bg0_palette, bg0_palette_size );

// Copying tileset, tilemap and palette for bg1
memcpy16( (void*)BG_TILE_RAM(2) , bg1_tileset, bg1_tileset_size );
memcpy16( (void*)BG_MAP_RAM(1)  , bg1_tilemap, bg1_tilemap_size );
memcpy16( &BG_PALETTE[16], bg1_palette, bg1_palette_size );

If you wonder about the funny memcpy16: it is the same as the standard memcpy except that it copies in 16 bit chunks. I wrote it myself (simple copy loop). Why? Because it seems that you can not write bytes to VRAM (only byte-pairs). Another option is to use swiCopy from ndslib.

6.5. Video banks

If you thought that was complex, I have some bad news. It gets harder (at least I think so). The hard topic is video memory flexability.

As we saw in the previous section, each graphics core expects graphics data at configurable memory areas (in the 06xx:xxxx range). The problem is, there is no memory at these locations (they are gaps). The programmer needs to map physical memory to these areas. Not plain memory, but dedicated video memory.

The video memory is partitioned in so-called banks, and the granularity of mapping video memory to the graphics cores is per bank. There are nine banks of video memory named VRAM_A through to VRAM_I. VRAM_A, VRAM_B, VRAM_C and VRAM_D are each 128kB, VRAM_E is 64kB, VRAM_F and VRAM_G are 16kB, VRAM_H is 32kB and VRAM_I is 16kB.

What makes the mapping hard is the fact that the video banks are largely non-orthogonal. For example, VRAM_A can be mapped to the main graphics core so that it can read background data from it, but VRAM_A can not be mapped to the sub graphics core. Similarly, VRAM_I can only be mapped to the sub core. Maybe surprisingly VRAM_C bank can be mapped to both cores.

To make it even more complex, the usage when mapped is not orthogonal. As mentioned in the previous paragraph, VRAM_A can be mapped to the main graphics core; it can then be used for background data, but also for sprite data. On the other hand, VRAM_C can also be mapped to the main graphics core; however it can only be used for background data, not for sprite data.

For a graphical overview, see dev-scene, a more intimidating textual description can be found at dualis. Or, check video.h.

Let's try to disect VRAM_A (leaving out some details that can be found on the above three sources). From dualis we learn (I've simplified his table that we can assign bank A to owner "main graphics core", and that we can map it to four different addresses: 0x6000000, 0x6020000, 0x6040000 and 0x6060000. The mapping is enabled with bit 7.

0x04000240 - VRAMCNT_A - VRAM Control A
Bit Explanation
0-2 Owner  (see below)
3-4 Offset (see below)
7   Enable (1=On, 0=Off)

Bank A is 128 kB and can be configured as follows:
  Value       Base address  Function
  000 00 000  N/A           Disabled
  100 00 001  0x6000000     Main core, BGs
  100 01 001  0x6020000     Main core, BGs
  100 10 001  0x6040000     Main core, BGs
  100 11 001  0x6060000     Main core, BGs

This coincides well with video.h. If we disect that (again leaving out details) we find the same conrol register (unfortunately using a different name VRAM_A_CR instead of VRAMCNT_A). We also find similar names and values for mapping the four addresses (and, by the way the alias VRAM_A_MAIN_BG for VRAM_A_MAIN_BG_0x06000000.

#define VRAM_A_CR      (*(vuint8*)0x04000240)

#define VRAM_ENABLE    (1<<7)
#define VRAM_OFFSET(n) ((n)<<3)

typedef enum {
  VRAM_A_LCD  = 0,
  VRAM_A_MAIN_BG  = 1,
  VRAM_A_MAIN_BG_0x06000000 = 1 | VRAM_OFFSET(0),
  VRAM_A_MAIN_BG_0x06020000 = 1 | VRAM_OFFSET(1),
  VRAM_A_MAIN_BG_0x06040000 = 1 | VRAM_OFFSET(2),
  VRAM_A_MAIN_BG_0x06060000 = 1 | VRAM_OFFSET(3),

The result of this is that a typical program issues



vramSetBankA( VRAM_A_MAIN_BG_0x06000000 );

which is the same because video.c contains the following function definitions

void vramSetBankA( VRAM_A_TYPE a )

This leads to the following map

        VRAM_A     tilebase      mapbase    allocated
______________ ____________ ____________ ____________
0x2:0000=128kB  0x4000=16kB    0x800=2kB
__ ___________ ____________ ____________ ____________
 0   0600:0000    0600:0000    0600:0000  MAP for BG0
                            ____________ ____________
                               0600:0800  MAP for BG1
                            ____________ ____________
               ____________ ____________ ____________
                  0600:4000    0600:4000 TILE for BG0
               ____________ ____________ ____________
                  0600:8000    0600:8000 TILE for BG1
               ____________ ____________ ____________
                  0600:C000    0600:C000
               ____________ ____________
                  0601:0000    0601:0000
               ____________ ____________
                  0601:4000    0601:4000
               ____________ ____________
                  0601:8000    0601:8000
               ____________ ____________
                  0601:C000    0601:C000
__ ___________ ____________ ____________
 1   0602:0000    0602:0000    0602:0000
__ ___________ ____________ ____________
 2   0604:0000    0604:0000    0604:0000
__ ___________ ____________ ____________
 3   0606:0000    0606:0000    0606:0000
______________ ____________ ____________
     0608:0000    0608:0000    0608:0000

6.6. Video demo 1: multiple backgrounds

Let's now bring theory into practice: we write a demo program with a white background on top of which we put a blue grid on top of which we put red dots in the grid.

Our program uses three backgrounds: bg0 (white background), bg1 (blue grid), and bg2 (red dots). All backgrounds are in text mode, so we can chose video modes 0, 1, and 3. Lets pick 0. By default, all three backgrounds are on. This explains the statements labeled A and B in main(): setting REG_POWERCNT and DISPLAY_CR.

Our memory map is a simple extension of the two-bg setup from the theory: tile-base 0 is not used for tile-sets, instead we use it to store all three the tile-maps. Tile-bases 1, 2, and 3 are used for bg0, bg1, and bg2.

        VRAM_A     tilebase      mapbase    allocated
______________ ____________ ____________ ____________
0x2:0000=128kB  0x4000=16kB    0x800=2kB
__ ___________ __ _________ __ _________ ____________
 0   0600:0000  0 0600:0000  0 0600:0000  MAP for BG0
                            __ _________ ____________
                             1 0600:0800  MAP for BG1
                            __ _________ ____________
                             2 0600:1000  MAP for BG2
                            ____________ ____________
               __ _________ ____________ ____________
                1 0600:4000    0600:4000 TILE for BG0
               __ _________ ____________ ____________
                2 0600:8000    0600:8000 TILE for BG1
               __ _________ ____________ ____________
                3 0600:C000    0600:C000 TILE for BG2
               ____________ ____________ ____________
                  0601:0000    0601:0000
__ ___________ ____________ ____________
 1   0602:0000    0602:0000    0602:0000

Furthermore, we use 256 color tiles, and a tile-map of 32x32. For the z-order, we put bg0 at the back (priority 3), bg1 in the middle (priority 2) and bg2 at the front (priority 1). This explains the statements under C.

All data fits in VRAM_A so statement D maps that.

We use a single palette. We need colors: white (color 1) for the background, blue (color 2) for the grid and red (color 3) for the dots. Note that color 0 is used for transparant pixels! Statement E copy the array defined in 2.

Note the we use our own memcpy16() (see under 1), because memory writes to video memory do not work in chunks of 8 bits.

Statement F first copies the tile-set (of one tile) and next fills the tile-map (with references to that one tile). The tile-sets are defined under 3. Since we have 256 color tiles, the rendering in C (using an u8 array) is rather readable. The F section occurs for all three layers

#include "nds.h"

// 1. Word based mem copy
void memcpy16( u16 *dst, u16 *src, int numbytes ) {
  int i= numbytes / 2;
  while( i>0 ) { *dst= *src; dst++; src++; i--; }

// 2. The palette
u16 palette[]= {
  /*0*/ RGB15( 0, 0, 0), // index 0 is transparant
  /*1*/ RGB15(31,31,31), // white background
  /*2*/ RGB15( 0, 0,31), // blue for cell borders
  /*3*/ RGB15(31, 0, 0)  // red for dot

// 3. Three tiles (maybe three tile-sets of each one tile)
u8 tile_white[8*8] = { // completely white

u8 tile_cell[8*8] = { // blue border

u8 tile_dot[8*8] = { // red dot

int main(void) {
  int i;

  // A. Enable both screens and both 2D cores

  // B. All four backgrounds are text/tiles (mode 0), but only use bg0, bg1, and bg2

  // C. BGs are 32x32, 256 colors
  BG0_CR= BG_32x32 | BG_COLOR_256 | BG_MAP_BASE(0) | BG_TILE_BASE(1) | BG_PRIORITY_3; // back
  BG1_CR= BG_32x32 | BG_COLOR_256 | BG_MAP_BASE(1) | BG_TILE_BASE(2) | BG_PRIORITY_2;
  BG2_CR= BG_32x32 | BG_COLOR_256 | BG_MAP_BASE(2) | BG_TILE_BASE(3) | BG_PRIORITY_1; // front

  // D. Allocate VRAM_A to the main core for BGs

  // E. Copying palette
  memcpy16( BG_PALETTE, palette, sizeof(palette) );

  // F1. Copy tile-set and tile-map
  memcpy16( (u16*)BG_TILE_RAM(1), (u16*)tile_white, sizeof(tile_white) );
  for( i=0; i<32*32; i++ ) ((u16*)BG_MAP_RAM(0))[i]= 0; // 0 is tile_white

  // F2. Copy tile-set and tile-map
  memcpy16( (u16*)BG_TILE_RAM(2), (u16*)tile_cell, sizeof(tile_cell) );
  for( i=0; i<32*32; i++ ) ((u16*)BG_MAP_RAM(1))[i]= 0; // 0 is tile_cell

  // F3. Copy tile-set and tile-map
  memcpy16( (u16*)BG_TILE_RAM(3), (u16*)tile_dot, sizeof(tile_dot) );
  for( i=0; i<32*32; i++ ) ((u16*)BG_MAP_RAM(2))[i]= 0; // 0 is tile_dot

  // Flush key buffer
  while( keysDown() ) scanKeys();

  int bg2_x= 0;
  int bg2_y= 0;
  while(1) {

    if( keysDown() & KEY_R ) REG_POWERCNT^= POWER_SWAP_LCDS; // =lcdSwap()
    if( keysDown() & KEY_L ) REG_POWERCNT^= POWER_LCD;

    if( keysDown() & KEY_X ) DISPLAY_CR^= DISPLAY_BG0_ACTIVE;
    if( keysDown() & KEY_A ) DISPLAY_CR^= DISPLAY_BG1_ACTIVE;
    if( keysDown() & KEY_B ) DISPLAY_CR^= DISPLAY_BG2_ACTIVE;

    if( keysDown() & KEY_Y ) {
      int p1= BG1_CR&3;
      int p2= BG2_CR&3;
      BG1_CR= (BG1_CR&~3)|p2;
      BG2_CR= (BG2_CR&~3)|p1;

    if( keysDown() & KEY_DOWN  ) { bg2_y--; BG2_Y0=bg2_y; }
    if( keysDown() & KEY_UP    ) { bg2_y++; BG2_Y0=bg2_y; }
    if( keysDown() & KEY_RIGHT ) { bg2_x--; BG2_X0=bg2_x; }
    if( keysDown() & KEY_LEFT  ) { bg2_x++; BG2_X0=bg2_x; }

  return 0;

Note that we have added a key dispatcher to see what other registers do.

It is also possible to put all tiles in one tileset, and use that tile-set for all three backgrounds. See bgdemo.c as an example.

6.7. Video demo 2: both screens

Controlling both screens is very similar. In this section we will use both screens, and on both screens, we will enable two layers. The 'back' layer is fully white on both cases, but the 'front' layer differs: for main we use the blue grid, and for sub we use the red dots.

The full demo can be downloaded. The highlights are shown here.

// Enable both screens and both 2D cores

// Set up main core
BG0_CR= BG_32x32 | BG_COLOR_256 | BG_MAP_BASE(0) | BG_TILE_BASE(1) | BG_PRIORITY_3; // back
memcpy16( BG_PALETTE, palette, sizeof(palette) );
memcpy16( (u16*)BG_TILE_RAM(1), (u16*)tiles, sizeof(tiles) );
for( i=0; i<32*32; i++ ) ((u16*)BG_MAP_RAM(0))[i]= 0; // 0 is tile white
for( i=0; i<32*32; i++ ) ((u16*)BG_MAP_RAM(1))[i]= 1; // 1 is tile cell

// Set up sub core
SUB_BG0_CR= BG_32x32 | BG_COLOR_256 | BG_MAP_BASE(0) | BG_TILE_BASE(1) | BG_PRIORITY_3; // back
memcpy16( BG_PALETTE_SUB, palette, sizeof(palette) );
memcpy16( (u16*)BG_TILE_RAM_SUB(1), (u16*)tiles, sizeof(tiles) );
for( i=0; i<32*32; i++ ) ((u16*)BG_MAP_RAM_SUB(0))[i]= 0; // 0 is tile white
for( i=0; i<32*32; i++ ) ((u16*)BG_MAP_RAM_SUB(1))[i]= 2; // 2 is tile dot

We are lazy, so we have defined a single palette, and copy that in the appropriate location for both cores. We have defined a single tile-set, and copy that in the appropriate location for both cores. Also note that the two background of the main core share the tile-set (and this also holds for the two backgrounds on the sub core).

6.8. Video demo 3: snake

I wrote my first, rather real, nds video game: snake! It is available for download.

7. Scratch area

This is the 'todo' chapter.

7.1 Power management

Still to explain writePowerManagement (hooked to the SPI bus like some other chips... drawing)
spi bus hw doc (Serial Peripheral Interface Bus)
SPI Bus is a 4-wire (Data In, Data Out, Clock, and Chipselect) serial bus.
The NDS supports the following SPI devices (each with its own chipselect).
DS Firmware Serial Flash Memory
DS Touch Screen Controller (TSC)
DS Power Management

But devkitPro\libnds\include\nds\arm7\serial.h sats
// Pick the SPI device
#define SPI_DEVICE_POWER (0 << 8)
#define SPI_DEVICE_FIRMWARE (1 << 8)
#define SPI_DEVICE_NVRAM (1 << 8)
#define SPI_DEVICE_TOUCH (2 << 8)
#define SPI_DEVICE_MICROPHONE (2 << 8)

7.2. Open Issues


Maarten Pennings
Laatste wijziging:
7 may 2010