Two commonly used character encoding systems in programming include ASCII and Unicode. While Unicode has the capability to represent a vast array of characters using code points from 0 to 0x10FFFF, ASCII is limited to encoding only 128 characters within 7 bits. In cases where characters beyond the ASCII range need to be processed or displayed in C++, it can be beneficial to convert ASCII character codes to their corresponding Unicode code points. This article will present a simple C++ program that converts user-inputted ASCII codes into the corresponding Unicode characters. The strategy involves directly mapping ASCII values to Unicode code points, which is effective for the standard ASCII range of 0-127. The provided code snippet demonstrates how this conversion can be achieved with concise C++ code, serving as a foundational step towards more sophisticated Unicode management in software applications.
What is the ASCCI Code?
The character encoding standard named ASCII (American Standard Code for Information Interchange) employs seven bits to represent 128 characters. Initially developed in the 1960s, ASCII was originally designed around the English alphabet.
The character set in ASCII encodes:
- Both capital and lowercase Letters in English (A-Z, a-z).
- Numbers 0 through 9.
- Symbols for punctuation.
- Control codes: line feed, carriage return, etc.
- Special Symbols such as such as !"#$%&'*+,-./:;<=>?@^_}{|}~.
The binary numbers ranging from 0000000 to 1111111 represent decimal values from 0 to 127, mapping to individual characters. For instance:
- Binary 0100001, equivalent to decimal 65, represents the character 'A',
- Whereas binary 01000010, equal to decimal 66, represents the character 'B'.
The initial 32 ASCII codes (0-31 decimal) are set aside for non-printable control characters such as null, tab, line feed, carriage return, etc. Codes 32-126 are designated for printable characters like letters, digits, and punctuation, while code 127 is specifically reserved for the deleted character.
The ASCII standard is limited to 7 bits for individual characters, whereas contemporary systems commonly employ 8 bits and designate the most significant bit as 0. This setup enables ASCII to coexist with other encodings within 8-bit environments.
What is Unicode?
In the majority of written languages, Unicode serves as a widely accepted standard within the computing field, ensuring uniform encoding, representation, manipulation, and management of text. Irrespective of the platform being used, Unicode provides a distinct numerical value to every character, regardless of its origin, purpose, or language.
Some key points about Unicode:
- Unicode enables text processing, storage, and transport independently of language and platform.
- Unicode standard can encode over 1 million characters. It includes characters of all major languages in the world.
- Unicode uses a coding space of 21 bits to define 1,112,064 code points. Each code point represents a unique character.
- The 21-bit space is divided into 17 planes, each with 65,536 (= 2^16) code points. The first plane (0000 - F) is called Basic Multilingual Plane (BMP) and contains characters for almost all modern languages.
- Unicode has bidirectional text, glyphs, collation and rendering standards to facilitate internationalization.
- The Unicode Consortium, a non-profit organization, maintains the Unicode standard. Major companies and organizations participate in developing Unicode standards.
- Unicode is device & platform-independent. The character represented by a Unicode code point will render consistently across devices.
- Unicode is backwards compatible with ASCII. The first 128 Unicode code points correspond to the ASCII characters.
What is the ASCII Table of Characters?
The ASCII chart is a character encoding standard that represents 128 characters using 7-bit binary values. ASCII is short for American Standard Code for Information Interchange.
The ASCII table includes:
- Uppercase and lowercase English letters
- Numeric digits
- Punctuation marks
- Control codes
- Special characters
Every ASCII character corresponds to a decimal value ranging from 0 to 127. This mapping enables the representation of characters through binary digits spanning from 0000000 to 1111111.
The first 32 ASCII codes (0-31) are reserved for non-printable control function characters like null, tab, line feed, carriage return, etc.
- Codes 32 to 47 represent various punctuation symbols.
- Codes 48 to 57 represent the numeric digits 0 to 9.
- Codes 65 to 90 are the uppercase letters A to Z.
- Codes 97 to 122 are the lowercase letters a to z.
The remaining codes are utilized for extra symbols and control characters. Here is the complete ASCII standard chart displaying each character linked to its decimal and hexadecimal code value:
| Decimal | Hex | Character | |
|---|---|---|---|
0 |
00 | NUL (null) | |
1 |
01 | SOH (start of heading) | |
2 |
02 | STX (start of text) | |
3 |
03 | ETX (end of text) | |
4 |
04 | EOT (end of transmission) | |
5 |
05 | ENQ (enquiry) | |
6 |
06 | ACK (acknowledge) | |
7 |
07 | BEL (bell) | |
8 |
08 | BS (backspace) | |
9 |
09 | TAB (horizontal tab) | |
10 |
0A | LF (newline) | |
11 |
0B | VT (vertical tab) | |
12 |
0C | FF (form feed) | |
13 |
0D | CR (carriage return) | |
14 |
0E | SO (shift out) | |
15 |
0F | SI (shift in) | |
16 |
10 | DLE (data link escape) | |
17 |
11 | DC1 (device control 1) | |
18 |
12 | DC2 (device control 2) | |
19 |
13 | DC3 (device control 3) | |
20 |
14 | DC4 (device control 4) | |
21 |
15 | NAK (negative acknowledge) | |
22 |
16 | SYN (synchronous idle) | |
23 |
17 | ETB (end of transmission block) | |
24 |
18 | CAN (cancel) | |
25 |
19 | EM (end of medium) | |
26 |
1A | SUB (substitute) | |
27 |
1B | ESC (escape) | |
28 |
1C | FS (file separator) | |
29 |
1D | GS (group separator) | |
30 |
1E | RS (record separator) | |
31 |
1F | US (unit separator) | |
32 |
20 | (space) | |
33 |
21 | ! | |
34 |
22 | " | |
35 |
23 | # | |
36 |
24 | $ | |
37 |
25 | % | |
38 |
26 | & | |
39 |
27 | ' | |
40 |
28 | ( | |
41 |
29 | ) | |
42 |
2A | * | |
43 |
2B | + | |
44 |
2C | , | |
45 |
2D | - | |
46 |
2E | . | |
47 |
2F | / | |
48 |
30 | 0 | |
49 |
31 | 1 | |
50 |
32 | 2 | |
51 |
33 | 3 | |
52 |
34 | 4 | |
53 |
35 | 5 | |
54 |
36 | 6 | |
55 |
37 | 7 | |
56 |
38 | 8 | |
57 |
39 | 9 | |
58 |
3A | : | |
59 |
3B | ; | |
60 |
3C | _PRESERVE4__ | |
63 |
3F | ? | |
64 |
40 | @ | |
65 |
41 | A | |
66 |
42 | B | |
67 |
43 | C | |
68 |
44 | D | |
69 |
45 | E | |
70 |
46 | F | |
71 |
47 | G | |
72 |
48 | H | |
73 |
49 | I | |
74 |
4A | J | |
75 |
4B | K | |
76 |
4C | L | |
77 |
4D | M | |
78 |
4E | N | |
79 |
4F | O | |
80 |
50 | P | |
81 |
51 | Q | |
82 |
52 | R | |
83 |
53 | S | |
84 |
54 | T | |
85 |
55 | U | |
86 |
56 | V | |
87 |
57 | W | |
88 |
58 | X | |
89 |
59 | Y | |
90 |
5A | Z | |
91 |
5B | [ | |
92 |
5C | \ | |
93 |
5D | ] | |
94 |
5E | ^ | |
95 |
5F | _ | |
96 |
60 | ` | |
97 |
61 | a | |
98 |
62 | b | |
99 |
63 | c | |
100 |
64 | d | |
101 |
65 | e | |
102 |
66 | f | |
103 |
67 | g | |
104 |
68 | h | |
105 |
69 | i | |
106 |
6A | j | |
107 |
6B | k | |
108 |
6C | l | |
109 |
6D | m | |
110 |
6E | n | |
111 |
6F | o | |
112 |
70 | p | |
113 |
71 | q | |
114 |
72 | r | |
115 |
73 | s | |
116 |
74 | t | |
117 |
75 | u | |
118 |
76 | v | |
119 |
77 | w | |
120 |
78 | x | |
121 |
79 | y | |
122 |
7A | z | |
123 |
7B | { | |
124 |
7C | ||
125 |
7D | } | |
126 |
7E | ~ | |
127 |
7F | DEL |
It covers the 128-character ASCII set with control codes, printable characters, punctuation, and special symbols. The table provides the decimal and hex values representing each character in the ASCII encoding standard.
C++ Implementation
- Get the decimal value of the ASCII character that needs to be converted. For example, 'A' has a decimal value of 65.
- For ASCII values between 0 and 127, simply assign the ASCII decimal value directly to the Unicode code point. It works because Unicode is backwards compatible with ASCII and maintains the same values for the first 128 characters.
- So for 'A' with ASCII value 65, the equivalent Unicode code point value is also 65.
- To convert this to an actual Unicode character cast the code point int variable to a char or wchar_t type in C++.
For example:
int unicode = 65;
wchar_t unicodeChar = (wchar_t)unicode; // unicodeChar contains 'A'
- It copies the ASCII value to the Unicode variable, which interprets it as a Unicode code point and converts it.
- For ASCII values above 127, lookup tables or switch statements would be required to map the ASCII value to the appropriate Unicode code point.
- Unicode library functions like mbstowcs, or MultiByteToWideChar can also convert ASCII to Unicode.
So, to recap, when dealing with the ASCII range of 0-127, you can directly map the ASCII decimal value to Unicode. For extended ASCII characters above 127, you would utilize mapping techniques to determine the corresponding Unicode code point. Then, you can convert the obtained integer code point to wchar_t or char to retrieve the Unicode character.
#include <iostream>
int main() {
std::cout << "Enter an ASCII code (0-127): ";
int asciiCode;
std::cin >> asciiCode;
// Convert ASCII to Unicode
int unicode = asciiCode;
// Print equivalent Unicode character
std::wcout << "Unicode character: " << (wchar_t)unicode << std::endl;
return 0;
}
Output:
Enter an ASCII code (0-127): 65
Unicode character: A