Help with Unicode files


John Shotsky
 

I'm trying to merge some html files into a single file. I have code to select
the files to merge, but for now just assume all html files in a folder. The
problem is that some of these html files have actual Unicode characters, not the
encoded versions. For example a single character 1/3, or smart quotes. I have
tried several approaches, but it seems something is converting them to question
marks in the append process. I'm using append to Unicode file to try to just
merge them into a single file and then open it in Unicode mode in Notetab. But
by the time I can see it, that question mark is already there. None of the
individual html files have been opened/saved by NoteTab - it is supposed to be
appending Unicode html files into a big Unicode html file, but as I say,
somewhere it goes wrong. If I open the Unicode files with a Unicode editor, the
correct character is present, so I know it is something that is being done by
NoteTab. Maybe if I shell out to dos for the merge and use xcopy/robocopy, etc?

I know NTP is not very good at Unicode, but when using Unicode commands, it
should at least work.

Regards,

John


Axel Berger
 

John Shotsky wrote:
I'm trying to merge some html files into a single file.
What exactly do you mean by merge? Are these complete HTML files and you
need to get rid of all the extra headers and </BODY></HTML>s?

The way I'd go is use a DOS command:

copy /b a.htm+b.htm+c.htm > all.htm

I'd then open it in non-UTF mode and run my UTF converter yielding &#nnn;

Lastly and if necessary I'd get rid of all the extra internal header data.


--
/¯\ No | Dipl.-Ing. F. Axel Berger Tel: +49/ 221/ 7771 8067
\ / HTML | Roald-Amundsen-Straße 2a Fax: +49/ 221/ 7771 8069
 X in | D-50829 Köln-Ossendorf http://berger-odenthal.de
/ \ Mail | -- No unannounced, large, binary attachments, please! --


John Shotsky
 

By merging, I simply mean getting over 200 html files into one html file. I can
do the rest, but the Unicode gets trashed in the copy.
Don't understand opening it in non-unicode mode - that simply makes each Unicode
character a question mark.
I tried your example, but that makes the command console flash for every file
copied. Over 200 flashes won't work.
Regards,
John

-----Original Message-----
From: Clips@Notetab.groups.io <Clips@Notetab.groups.io> On Behalf Of Axel Berger
Sent: Saturday, April 25, 2020 11:10 AM
To: Clips@Notetab.groups.io
Subject: Re: [NTB-Clps] Help with Unicode files

John Shotsky wrote:
I'm trying to merge some html files into a single file.
What exactly do you mean by merge? Are these complete HTML files and you need to
get rid of all the extra headers and </BODY></HTML>s?

The way I'd go is use a DOS command:

copy /b a.htm+b.htm+c.htm > all.htm

I'd then open it in non-UTF mode and run my UTF converter yielding &#nnn;

Lastly and if necessary I'd get rid of all the extra internal header data.


--
/¯\ No | Dipl.-Ing. F. Axel Berger Tel: +49/ 221/ 7771 8067
\ / HTML | Roald-Amundsen-Straße 2a Fax: +49/ 221/ 7771 8069
 X in | D-50829 Köln-Ossendorf http://berger-odenthal.de
/ \ Mail | -- No unannounced, large, binary attachments, please! --


Axel Berger
 

John Shotsky wrote:
Don't understand opening it in non-unicode mode - that simply makes each Unicode
character a question mark.
No! It shows the file as it is, i.e. two (or more) strange upper half ASCII
characters, not one decoded one.

I always start NT with the command line parameter "/RawUTF8" (Took quite
some effort to set up, but worth it). Before I did that, the way to go and
prevent NT from doing its misguided UTF shenanigans was to open a new,
empty document in NT, open the file in another editor, copy and paste into
NT.

I open and convert UTF files all the time so I know it works and how.

I tried your example, but that makes the command console flash for every file
copied. Over 200 flashes won't work.
I don't understand. 200 files will probably exceed the maximum command line
length but apart from that it works. Who or what flashes? It is one single
command and it just runs once -- at least if I typed the syntax correctly
from memory. There a slightly different flavours, look it up. Mine is for
4DOS 7.50 or DOS 7.10 (i.e. Win98).

Correction: I just checked and the ">" is wrong.

copy /b a.htm+b.htm+c.htm all.htm

should do it.


--
/¯\ No | Dipl.-Ing. F. Axel Berger Tel: +49/ 221/ 7771 8067
\ / HTML | Roald-Amundsen-Straße 2a Fax: +49/ 221/ 7771 8069
 X in | D-50829 Köln-Ossendorf http://berger-odenthal.de
/ \ Mail | -- No unannounced, large, binary attachments, please! --


John Shotsky
 

That makes more sense. But it is also verbose and will write into the doc,
unless you add >nul at the end of the command. The reason it's flashing the
command console is because I need it to run for each selected file. By default
they are all selected, but users can choose to only run one file, if they want.
So I have a user selected set of files to run each time, and I have it process
through that list in a loop.
I decided to do it a different way entirely. All my users have Calibre already,
so when I get a list of the files they want processed, I will create an .opf
file with the list as links, and let Calibre make an ebook out of it. From
there, I have everything else I need.
Regards,
John

-----Original Message-----
From: Clips@Notetab.groups.io <Clips@Notetab.groups.io> On Behalf Of Axel Berger
Sent: Saturday, April 25, 2020 2:42 PM
To: Clips@Notetab.groups.io
Subject: Re: [NTB-Clps] Help with Unicode files

John Shotsky wrote:
Don't understand opening it in non-unicode mode - that simply makes
each Unicode character a question mark.
No! It shows the file as it is, i.e. two (or more) strange upper half ASCII
characters, not one decoded one.

I always start NT with the command line parameter "/RawUTF8" (Took quite some
effort to set up, but worth it). Before I did that, the way to go and prevent NT
from doing its misguided UTF shenanigans was to open a new, empty document in
NT, open the file in another editor, copy and paste into NT.

I open and convert UTF files all the time so I know it works and how.

I tried your example, but that makes the command console flash for
every file copied. Over 200 flashes won't work.
I don't understand. 200 files will probably exceed the maximum command line
length but apart from that it works. Who or what flashes? It is one single
command and it just runs once -- at least if I typed the syntax correctly from
memory. There a slightly different flavours, look it up. Mine is for 4DOS 7.50
or DOS 7.10 (i.e. Win98).

Correction: I just checked and the ">" is wrong.

copy /b a.htm+b.htm+c.htm all.htm

should do it.


--
/¯\ No | Dipl.-Ing. F. Axel Berger Tel: +49/ 221/ 7771 8067
\ / HTML | Roald-Amundsen-Straße 2a Fax: +49/ 221/ 7771 8069
 X in | D-50829 Köln-Ossendorf http://berger-odenthal.de
/ \ Mail | -- No unannounced, large, binary attachments, please! --


John Wallace
 

On 2020-04-25 19:23, John Shotsky wrote:
That makes more sense. But it is also verbose and will write into the doc,
unless you add >nul at the end of the command. The reason it's flashing the
command console is because I need it to run for each selected file. By default
they are all selected, but users can choose to only run one file, if they want.
So I have a user selected set of files to run each time, and I have it process
through that list in a loop.
I decided to do it a different way entirely. All my users have Calibre already,
so when I get a list of the files they want processed, I will create an .opf
file with the list as links, and let Calibre make an ebook out of it. From
there, I have everything else I need.
Regards,
John

You just need the stuff between the body and /body tags?
(not the stuff in the head /head stuff)


--
John Wallace
Pontiac Power RULES !!!
www.wallaceracing.com


Art Kocsis
 

On 04-25-2020 09:56, John Shotsky wrote:
I know NTP is not very good at Unicode, but when using Unicode commands, it
should at least work.
Now John, you've been around computers long enough, how could you not see the humor in that!

How many times have you even mumbled to your self; "It should work!" Hopefully, your hair grows back quickly.

Thanks for the chuckle.

Art