Unicode System In Java

Types of Encoding

Following are the different types of encoding used before the Unicode system.

  • ASCII (American Standard Code for Information Interchange): used for the United States
  • ISO 8859-1 used for the Western European Languages
  • KOI-8 used for Russian
  • GB18030 and BIG-5 used for Chinese and so on.
  • Base64 used for binary to text encoding
  • Why does Java use Unicode System?

The encoding methods utilized prior to the implementation of Unicode encountered several constraints. Across various languages, distinct letters are utilized, each assigned a unique code, resulting in multiple codes for different letters in different languages. Certain languages incorporate numerous character sets, leading to variations in the length of codes assigned to individual characters. For instance, while some characters can be represented by a single byte, there are others that necessitate two or more bytes for encoding.

These challenges prompted the search for an improved character encoding solution known as the Unicode System.

What is Unicode System?

  • Unicode system is an international character encoding technique that can represent most of the languages around the world.
  • Unicode System is established by Unicode Consortium.
  • Hexadecimal values are used to represent Unicode characters.
  • There are multiple Unicode Transformation Formats: UTF-8: It represents 8-bits (1 byte) long character encoding. UTF-16: It represents 16-bits (2 bytes) long character encoding UTF-32: It represents 32-bits (4 bytes) long character encoding.
  • To access a Unicode character the format starts with an escape sequence \u followed by 4 digits hexadecimal value.
  • A Unicode character has a range of possible values starting from \u0000 to \uFFFF.
  • Some of the Unicode characters are \u00A9 represent the copyright symbol - © \u0394 represent the capital Greek letter delta - Δ \u0022 represent a double quote - "
  • UTF-8: It represents 8-bits (1 byte) long character encoding.
  • UTF-16: It represents 16-bits (2 bytes) long character encoding
  • UTF-32: It represents 32-bits (4 bytes) long character encoding.
  • Java Unicode Example

    Example

    Example
    
    public class Main {
    
       public static void main (String[]args) {   		 
    
          //Unicode characters
    
          char a = '\u0041';
    
          char b = '\u0042';
    
    
    
          // printing unicode
    
          System.out.println("a = " + a);
    
          System.out.println("b = " + b);
    
       }
    
    }
    

Output:

Program to convert UTF-8 to Unicode

UnicodeDemo.java

Example

public class UnicodeDemo 

{

   public static void main(String ar[]) throws Exception 

   {

      String str1 = "Unicode Sytem\u00A9";

      byte[] charset = str1.getBytes("UTF-8");

      String newstr = new String(charset, "UTF-8");

      System.out.println(newstr);

   }

}

Output:

Output

Unicode Sytem©

Within the provided code snippet, a new class named UnicodeDemo is instantiated. Initially, a Unicode String labeled as str1 undergoes transformation into UTF-8 format by utilizing the getBytes function. Subsequently, the byte array is reconverted back into Unicode, and the resulting value stored in newstr is showcased on the console.

Problem Caused by Unicode

The intention behind creating the Unicode standard was to establish a 16-bit character encoding system that could accommodate all global characters by utilizing the char primitive data type. However, the limitation of the 16-bit encoding meant that it could only represent a total of 65,536 characters, which proved inadequate for encompassing the full range of characters present worldwide.

The Unicode system has been expanded to accommodate 1,112,064 characters. Supplementary characters, which exceed 16 bits, are identified by Java through a combination of two char values.

This post covers fundamental techniques for encoding, the Unicode System in Java, challenges associated with the Unicode system, and a Java script showcasing the application of the Unicode system.

Input Required

This code uses input(). Please provide values below: