For some reason the characters (alpha, numeric) are prepended with either 'C' or 'B' in the mainframe file. Special characters (e.g. '@') are just single byte without the above extra characters.
The following script handles the characters well. It does not handle numbers (integers, floating points, signed, unsigned, etc.) because without a copybook, it is impossible to convert it correctly. I leave that as the next challenge.
import codecs
import sys
# Usage instruction: python3 converter.py FILENAME
filename = sys.argv[1]
# EBCDIC format in those files are not standard
# General method
# Read in 2 bytes
# Look at first byte
# C means capital or number
# B means lowercase
# Is a number if 2nd byte is >= 0xb0 (b0 is '0') for whatever reason
# Lookup dict for special characters
# {EBCDIC_code : ascii_char}
hexLookup = {
0x40 : ' ',
0x4b : '.',
0x6b : ',',
0x7c : '@',
0x61 : '/',
0x80 : '{',
0x90 : '}',
}
# Number of characters on each line
LINEWIDTH = 999
with codecs.open(filename, 'rb') as input, open('converted.txt', 'w', encoding='utf-8') as output:
lineLength = LINEWIDTH
while True:
lineLength -= 1
ebcdicChar = input.read(1) # Read 1 bytes at a time
unicodeChar = codecs.decode(ebcdicChar, "cp500")
if not ebcdicChar: # Finish when no more chars
break
#actualChar = ' '
actualChar=unicodeChar
if unicodeChar == 'C':
actualChar = input.read(1)
# If meant to be a number
if 0xb0 <= actualChar[0] <= 0xb9:
actualChar = str(actualChar[0] - 0xb0)
# For special characters
#elif actualChar[0] in hexLookup:
# actualChar = hexLookup[actualChar[0]]
else:
actualChar = codecs.decode(actualChar, 'cp500').upper()
elif unicodeChar == 'B':
actualChar = codecs.decode(input.read(1), 'cp500')
elif ebcdicChar[0] in hexLookup:
actualChar = hexLookup[ebcdicChar[0]]
unicodeContent = codecs.encode(actualChar, 'utf-8')
output.write(unicodeContent.decode('utf-8'))
if lineLength == 0:
output.write('\n')
lineLength = LINEWIDTH
A few days later, I realised why I got the strange file - the person who created the file used Notepad++ copy and paste (ctrl c, ctrl v). This basically turns the binary data into utf-16 (by the looks of it). By the way Notepad++ does support binary copy and paste. What a waste of time!