PGTS G. Patterson.   T/A PGTS ABN: 99885392845

point Site Navigation

point Other Blog Threads



  Valid HTML 4.01 Transitional

   Stop Spam! Stop Viruses!
   Secure And Reliable Ubuntu Desktop!

   Ubuntu

   If you own a netbook/laptop~
   Download Ubuntu Netbook!






PGTS Humble Blog

Thread: Internet Standards & Competition

Author Image Gerry Patterson. The world's most humble blogger
And if your head explodes with dark forebodings too -- I'll see you on the dark side of the moon!

Translating UTF-16 to 7-bit ASCII


Chronogical Blog Entries:



Date: Sun, 17 Aug 2025 18:20:54 +1000

UTF-16 has been widely adopted by Microsoft. If you mostly work with 7-bit ASCII in Linux, you'll have to come up with ways to deal with it.

Earlier this year, I asked Chat GPT to write a C program that would translate UTF-16 to 7-bit ASCII. I didn't really need to do this, because I already had processes [on the Windows side] that would handle it. However I saw this as an opportunity to do some research into using AI to generate code. The code that was delivered would have worked, but I was somewhat underwhelmed by the quality.

A couple of months later, I repeated the experiment and the resulting code looked much better. I require some attention ... Adding options, moving some of the code into subroutines and adding a subroutine to trim white space ... And there was also a bug ... The original code did not handle the header byte in MS UTF-16 data files. But overall the fact that the bulk of it had been created by Chat GP, saved me several hours of research and coding. I was able to get a working C Program consisting of 170 lines of code and comments, in less than half an hour.

Then I discovered that the dos2unix suite of tools, which I have installed on almost every Linux machine I have setup, already handles this conversion. For the past three decades I have only used dos2unix to handle CR-LF conversion. As it turned out, all that was required was a little RTFM:

The third paragraph of the man page for the current version of dos2unix states:

Besides line breaks Dos2unix can also convert the encoding of files. A few DOS code pages can be converted to Unix Latin-1. And Windows Unicode (UTF-16) files can be converted to Unix Unicode (UTF-8) files.

And as I read the man page, I realised it was a larger than I thought it would be, with a whole bunch of additional options and improvements ... It seems that UTF-16 conversion may have been added to version 6 around 2012 or so ... Oh well, it was an interesting experiment and for what it's worth, below is the code, after I had cleaned it up, added options and debugged it ...


#include <ctype.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>

/*
 * The original script was written with Chat GPT 4.0
 *
 * Code was put into a subroutine, so that if any arguments are specified,
 * the processs will attempt to open the arguments as files.
 *
 * Output is written to stdout
 *
 * Gerry Patterson (2025)
 */

#define MAX_BUF 8192
FILE *INP;
unsigned char buf[MAX_BUF];
int ptr = 0;
int use_stdin = 0;
int opt_p = 0;

/* ------------------------------------------------------------------------ */

/* rtrim() - get rid of trailing white space */

char *rtrim(char *ptr)
{
	int i = strlen(ptr);
	for(i--; i >= 0; i--) {
		if(isspace(ptr[i]))
			ptr[i] = '\x00';
		else
			break;
	}
	return (ptr);
}

/* ------------------------------------------------------------------------ */

void store_char ( unsigned char c)
{
	buf[ptr++] = c;
	if (c == 0x0A) {
		if (opt_p) {
			printf("%s\n", buf);
		} else {
			printf("%s\n", rtrim(buf));
		}
		memset(buf, 0, sizeof(buf));
		ptr = 0;
	}
}

/* ------------------------------------------------------------------------ */

void usage()
{
	printf("\n");
	printf("USAGE:\n");
	printf("\tu2asc [OPTION] [FILE]\n\n");
	printf("OPTIONS:\n");
	printf("\t-h [Or any other char] Show this screen.\n");
	printf("\t-p Preserve trailing whitespace on input lines\n\n");
	printf("If no file[s] specified read stdin.\n\n");
	exit(1);
}

/* ------------------------------------------------------------------------ */

int u2asc ()
{
	unsigned int byte1, byte2;
	uint16_t utf16_char;
	char ascii_char;
	int w_read = 0;

	while ((byte1 = fgetc(INP)) != EOF) {
		byte2 = fgetc(INP);
		if (w_read < 2)
			w_read++;
		if (byte2 == EOF) {
			// Handle unexpected EOF (odd number of bytes)
			fprintf(stderr, "Unexpected end of input.\n");
			return 1;
		}

		// Construct UTF-16LE character
		utf16_char = (uint16_t)((byte2 << 8) | byte1);

		// Convert to ASCII
		if (utf16_char <= 0x7F) {
			ascii_char = (char)utf16_char;
		} else {
			// Check for UTF-16 Header word
			if (w_read == 1 && utf16_char == 0xFEFF)
				continue;
			ascii_char = '?';  // Replace non-ASCII with '?'
		}

		// fputc(ascii_char, stdout);
		store_char(ascii_char);
	}
	return 0;
}

/* ------------------------------------------------------------------------ */

/*
 * Process options.
 * See getopt.c for more details
 */

int process_opt(argc,argv)
int argc;
char **argv;
{
	int index;
	int c;
	int argn = 0;

	while ((c = getopt (argc, argv, "ph")) != -1) switch (c) {

		case 'p':
			opt_p++;
			break;

		case 'h':
			usage();

		case '?':
			/* sanity check options */
			printf("\nERROR Unrecognised option, Valid options are as below:\n");
			usage();
		default:
			break;

	}
	if (optind == argc) {
		/* no file specified - use stdin */
		INP = stdin;
		return (1);
	}
	return(0);
}

/* ------------------------------------------------------------------------ */

int main (int argc, char **argv)
{
	int i;
	use_stdin = process_opt(argc, argv);
	if (use_stdin) {
		// Read data from stdin
		u2asc();
	} else {
		for (i = 1; i < argc; i++) {
			if ( (INP = fopen( argv[i], "r" ) ) == NULL ) {
				fprintf(stderr,"ERROR Cannot open file: %s\n",argv[i]);
				// Could replace this with continue?
				return(1);
			}
			if (u2asc()) {
				printf("\n");
				fprintf(stderr,"ERROR Processing file: %s\n",argv[i]);
				return(1);
			};
			fclose(INP);
		}
	}
	return(0);
}


The above simple one-trick pony, doesn't match the versatility of dos2unix. Still, the improvement in the code produced by Chat GP 4.0 has been impressive.

However, I think it would be premature for any software development firm to start dismissing most of their workforce, for the time being at least. Overall the experiments I ran produced a usable result because I asked the right questions. IMHO Creating initial specifications, testing, documenting, identifying and fixing bugs, as well as integration with legacy processes ... Are all tasks that require a human agent. One who understands the basics of systems analysis, architecture and programming.

Also it has seemed to me that Chat GP seemed try a little too hard to prove it can pass the Turing test. And they wouldn't be the first. In fact Google has been quietly passing the test for the past two decades. Using the summary option in Google searches can produce a general purpose result that matches Chat GPT, and it includes a long list of related URLs for more context and further research (if required). Now that might be because I didn't pony up with the necessary wherewithal for a subscription to Chat GPT. But for the time being I think I'll keep my powder dry,



Other Blog Posts In This Thread:

Copyright     2025, Gerry Patterson. All Rights Reserved.