May 5, 2008 1:35 PM PDT

Google: Unicode conquers ASCII on the Web

I picture it happening this way. The Roman alphabet is on the run, pursued by a much larger army of Arabic characters with long scimitar-like ligatures, Chinese characters that look like throwing stars, and European peasant letters bristling with umlauts, cedillas, and tildes.

Unicode now is the most common character encoding method on the Web.

Unicode now is the most common character encoding method on the Web.

(Credit: Google)

Unicode has overtaken ASCII as the most popular character encoding scheme on the World Wide Web, Mark Davis, Google's senior international software architect, said in a blog post. Also vanquished at almost exactly the same time was the Western European encoding.

Unicode is a character encoding standard that gracefully accommodates dozens of languages as well as Roman characters with diacritical marks. ASCII, a tried-and true, decades-old standard, is limited to 128 or 256 characters and has a hard time extending beyond the range of a century-old Remington typewriter.

Unicode vanquished ASCII and Western European within 10 days in December, Davis said.

"What's more impressive than simply overtaking them is the speed with which this happened," he added, pointing to a graph showing the meteoric rise of Unicode.

Google's a fan of Unicode Web sites. When it processes data from Web sites, it converts it into Unicode first if it's not already there. That improves international search abilities.

"The continued rise in use of Unicode makes it even easier to do the processing for the many languages that we cover," he said.

Google just converted to Unicode 5.1, he added, "so people speaking languages such as Malayalam can now search for words containing the new characters," he said.

One disadvantage Unicode has over ASCII, though, is that it takes at least twice as much memory to store a Roman alphabet character because Unicode uses more bytes to enumerate its vastly larger range of alphabetic symbols.

Recent posts from Underexposed
Google adds Android app for Flickr photos
Revamped Google Picasa site identifies photo faces
Adobe gets an e-earful, and listens
Microsoft, Nikon sign patent-sharing deal
Canon wises up with 50D sensor and new zoom
Add a Comment (Log in or register) 5 comments
It doesn't always take up more storage
by chriswaco May 5, 2008 4:24 PM PDT
The story says "One disadvantage Unicode has over ASCII,
though, is that it takes at least twice as much memory to store a
Roman alphabet character".

That's not really true with UTF-8. For most Western/Roman
characters, UTF-8 takes up exactly one byte per character just
like ASCII. When you get into accent marks and non-Roman
character sets, though, UTF-8 can take up more than two bytes
per character.

See:
http://en.wikipedia.org/wiki/UTF-8
Reply to this comment
Most unicode content on the web is encoded in utf-8
by JasonTrue May 5, 2008 5:06 PM PDT
Utf-8 doesn't take dramatically more space than ASCII or ISO-8859-1 encodings, except for East Asian Languages and certain European characters, which can take up to 50% more space than the 16-bit encoding for Unicode.

In Windows programs, text is typically represented as UTF-16 internally, which does take up more space, but generally behaves faster, since the Windows APIs are natively UTF-16.

The older single-byte/double-byte API equivalents are quietly converted to Unicode on each call, which can slow programs down a bit if they are particularly text-heavy.
Reply to this comment
UTF-8
by RussHolsclaw May 6, 2008 7:50 AM PDT
As mentioned by others, the UTF-8 format of Unicode encoding significantly reduces the overhead of Unicode, in most cases, because the character codes that correspond to the base ASCII character set are identical to ASCII itself: one byte per character. For others, the overhead is not too great. This is especially true when compared to the typical HTML/XML method of encoding non-ASCII characters by the use of "character entity" sequences, which allow non-ASCII characters to be included on a web page. These sequences are all much longer than the equivalent UTF-8 encoding of the same characters. It also permits a single web page to contain text in multiple languages at the same time.
Also, since web pages consist largely of HTML tags and client-side scripts, which are made up of pure ASCII characters, these take up no more space than if it page were ordinary ASCII or some ISO ASCII extension set.
Reply to this comment View reply
by krosavcheg May 9, 2008 2:18 AM PDT
1) The "meteoric rise" of unicode is indisputable, but the graph is misleading. 75% of the web is still not unicode. Since the family of unicode text encodings aims to replace all other encodings, the graph really should have only 2 lines, "unicode encodings" and "other encodings".

2) As other commenters remarked, the overhead of unicode encodings is minimal. Overhead should never be an argument against using a unicode encoding. Anyone who has to deal with multiple text encodings in organically evolved (i.e. not carefully designed) IT systems will agree.

wcoenen (logged in with bugmenot.com)
Reply to this comment
Powered by Jive Software
advertisement

Latest tech news headlines

About Underexposed

This blog sheds light on digital photography, science, and open-source software. Shankland joined CNET News in 1998, after a five-year stint as a science writer. He's a lab rat who grew up in Los Alamos, N.M., and graduated from Harvard.

Contact Stephen at Stephen.Shankland@cnet.com

Add this feed to your online news reader

Underexposed topics

Stuff I'm reading

Featured blogs

advertisement

Inside CNET News

Scroll Left Scroll Right
  • Nanotech: The Circuits Blog

    Timing rumors surface for AMD plant spin-off

    Rumors persist that Advanced Micro Devices is planning to spin off all or part of its manufacturing operations.

  • Gallery

    Photos: Ron Paul's RNC alternative

    As the Republican convention took place just miles away, a crowd rallied for the former presidential candidate and his message of limited government, ensured civil liberties, lower taxes, and peace.

  • Digital Noise: Music and Tech

    Was 1980s music that bad?

    NPR asks listeners which year featured the best music, and the 1980s emerge as a bleak era. Personally, the '80s figure prominently in my collection, but well behind the 1970s.

  • Beyond Binary

    Microsoft begins big ad push

    Microsoft's multi-year push, estimated at $300 million, begins with a spot featuring Bill Gates and Jerry Seinfeld aired during Thursday's NFL game.

  • Video

    YouTube plays party politics

    During the presidential campaigning four years ago, YouTube didn't even exist. Now it's a tool candidates must master to get their message across. CNET's Kara Tsuboi stops by the YouTube upload booths at the Democratic and Republican conventions to find out why Google's video site has such a big presence in Denver and St. Paul, Minn.

  • News - Digital Media

    Michael Moore plans Net-only film premiere

    Filmmaker plans to premiere his latest documentary exclusively on the Internet for free, forgoing the traditional theatrical release.

  • Video

    Political party playlists

    We know the Democrats and Republicans are split over policy issues, but does their musical taste fall down party lines too? And what kind of gadgets did they bring to the conventions to listen to their music? CNET reporter Kara Tsuboi finds out.

  • News - Politics and Law

    What you can--and can't--find about Palin on the Internet

    John McCain's choice of Sarah Palin as a running mate has inspired a wealth of creativity on the Internet.

  • News - Cutting Edge

    Execs predict next Google-like tech

    On eve of company's 10-year anniversary, researchers and business pundits speculate about what technologies might someday have as much impact as Google.

  • Gallery

    Photos: The brains behind Google Chrome

    Here's a look at some of the engineers and executives who took the stage at the company's headquarters as they unveiled the new browser.

  • Crossfade

    Ying Yang Twins, 'Look Back At It': Free MP3 of the Day

    This amped-up duo gets the party started with a mix of crisp, Southern hip-hop beats and shout-along rhymes. Download a free MP3 of "Look Back At It" courtesy of CNET Download Music.

  • Green Tech

    Clean-tech group forms to support Obama

    "Clean Tech and Green Business for Obama" aims to raise $1 million for the Democratic presidential nominee while elevating issues of climate change and alternative energy.