Character Encoding

Character Encoding

During my time at Datadobi, I have had the opportunity to work with a number of customers with some of the most complex filesystem deployments in the world. With many of these environments approaching petabyte scale, the migration discussions tied to these systems are equally complex, especially when migrating across different storage platforms and different vendors. It’s not uncommon for filesystem hardware tech refreshes to reveal, in some instances, many years of poor filesystem practices – often leading to very complicated datasets. For this blog entry, we will be focusing on the topic of Character Encoding and the impact it can have on your data migrations.

With that said, let’s take a step back in time…

So, there has never been a single character set for all characters in use around the world. I find this fascinating, especially when you think about how many people are on the internet, how many languages there are around the world, and how many thousands of characters there are in languages like Chinese, as an example. Now, there has been a series of standards that have been defined over the years, for example ISO-8859. The problem with these standards is that every region has had its own character set that supported different sets of characters that mapped those characters to different numbers. An example of this can be seen in the following: The Western European standard (ISO-8859-1) maps the value 216 to the Swedish Ø character. The Central European standard (ISO-8859-2) on the other hand maps the same value of 216 to the Ř character. It does not take long to see how this can get confusing.

Now, let’s take the discussion a bit further. When computer systems create files and directories on filesystems, the filesystem encodes the file name using a particular character encoding. When it comes to network storage systems, it is extremely important to store file and directory names on the filesystem using the encoding that the filesystem expects. These filesystem technologies typically need to be capable of converting file and directory names to different encodings, all of which are dependent on the client system and protocol that is being used to access the data.

This is evident in the behavior of an NFSv4 client, which always expects the UTF-8 encoding. When it comes to SMBv2, the client always expects UTF-16. I recently had the opportunity to participate in a proof of concept (POC) with a large financial services customer who had datasets coming from a number of different clients with no standard encoding parameters implemented in the environment. This scenario can complicate things significantly when trying to migrate datasets like this. The POC allowed us to demonstrate the ability to identify the character-encoding issues and provide solutions to the issues leveraging functionality that are included to our migration software.

At Datadobi, we perform unstructured data migrations in the largest, most complicated filesystem environments. When it comes to character-encoding issues related to filesystem migrations, it’s always best practice to use a data migration to clean up those datasets and prevent migration errors and potential data access issues. DobiMigrate is an enterprise-ready, purpose-built migration tool that includes advanced features that allow you to dictate character-encoding parameters at the migration path, as well as configure fallback encoding at the proxy layer (data mover). Features like this allow our customers to overcome complicated encoding issues during their filesystem migrations and clean up legacy datasets.

Feel free to reach out to our sales team with any questions.