Monday, 7 August 2023

COBOL DataSet EBIDIC to ASCII Converter

 For some reason the characters (alpha, numeric) are prepended with either 'C' or 'B' in the mainframe file. Special characters (e.g. '@') are just single byte without the above extra characters.

The following script handles the characters well. It does not handle numbers (integers, floating points, signed, unsigned, etc.) because without a copybook, it is impossible to convert it correctly. I leave that as the next challenge.

import codecs
import sys

# Usage instruction: python3 converter.py FILENAME
filename = sys.argv[1]

# EBCDIC format in those files are not standard
# General method
# Read in 2 bytes
# Look at first byte
# C means capital or number
# B means lowercase
# Is a number if 2nd byte is >= 0xb0 (b0 is '0') for whatever reason

# Lookup dict for special characters
# {EBCDIC_code : ascii_char}
hexLookup = {
    0x40 : ' ',
    0x4b : '.',
    0x6b : ',',
    0x7c : '@',
    0x61 : '/',
    0x80 : '{',
    0x90 : '}',
}
# Number of characters on each line
LINEWIDTH = 999

with codecs.open(filename, 'rb') as input, open('converted.txt', 'w', encoding='utf-8') as output:
    lineLength = LINEWIDTH
    while True:
        lineLength -= 1
        ebcdicChar = input.read(1) # Read 1 bytes at a time
        unicodeChar = codecs.decode(ebcdicChar, "cp500")

        if not ebcdicChar: # Finish when no more chars
            break
       
        #actualChar = ' '
        actualChar=unicodeChar
        if unicodeChar == 'C':
            actualChar = input.read(1)
            # If meant to be a number
            if 0xb0 <= actualChar[0] <= 0xb9:
                actualChar = str(actualChar[0] - 0xb0)
            # For special characters
            #elif actualChar[0] in hexLookup:
            #    actualChar = hexLookup[actualChar[0]]
            else:
                actualChar = codecs.decode(actualChar, 'cp500').upper()
        elif unicodeChar == 'B':
            actualChar = codecs.decode(input.read(1), 'cp500')
        elif ebcdicChar[0] in hexLookup:
            actualChar = hexLookup[ebcdicChar[0]]
           
        unicodeContent = codecs.encode(actualChar, 'utf-8')
       
        output.write(unicodeContent.decode('utf-8'))
       
        if lineLength == 0:
            output.write('\n')
            lineLength = LINEWIDTH

A few days later, I realised why I got the strange file - the person who created the file used Notepad++ copy and paste (ctrl c, ctrl v). This basically turns the binary data into utf-16 (by the looks of it). By the way Notepad++ does support binary copy and paste. What a waste of time!