-
Notifications
You must be signed in to change notification settings - Fork 0
/
TFormatSupport.pm
2965 lines (2474 loc) · 88.9 KB
/
TFormatSupport.pm
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
#!/usr/bin/env perl -w
#
# TFormatSupport.pm: Support specific file formats for TabularFormats.pm.
# Original, 2010-03-23 by Steven J. DeRose, as csvFormat.pm
#
package TFormatSupport;
use strict;
use feature 'unicode_strings';
use sjdUtils;
#sjdUtils::try_module("HTML::Entities") || warn
# "Can't access CPAN HTML::Entities module.\n";
sjdUtils::try_module("Datatypes") || warn
"Can't access sjd Datatypes module.\n";
#sjdUtils::try_module("FakeParser") || warn
# "Can't access sjd FakeParser module (needed for quasi-XML support).\n";
our %metadata = (
'title' => "TFormatSupport",
'description' => "",
'rightsHolder' => "Steven J. DeRose",
'creator' => "http://viaf.org/viaf/50334488",
'type' => "http://purl.org/dc/dcmitype/Software",
'language' => "Perl 5",
'created' => "2010-03-23",
'modified' => "2021-09-16",
'publisher' => "http://github.com/sderose",
'license' => "https://creativecommons.org/licenses/by-sa/3.0/"
);
our $VERSION_DATE = $metadata{'modified'};
=pod
=head1 Usage
use TFormatSupport.pm
Implements parsing and generation for basic record/field structures
across many syntactic forms, for use via C<TabularFormats.pm>.
The functionality and expressiveness are
essentially that of CSV and its kin; however, many formats
are supported for such simple data (even formats that can do more in general).
=head2 Formats and variations supported
The I<basicType> option values supported include these, which have a simple
records/fields structure:
B<ARFF> (for WEKA system),
B<COLUMNS> (column-oriented),
B<CSV> (lots of variations),
B<MIME> (headers),
B<XSV> (a simple XML subset designed for CSV-style data).
and also the following more sophisticated formats, that
can be used in simple ways that correspond to
a basic records/fields structure (examples below):
B<JSON> (a single top-level array or hash, of hashes),
B<Manchester> (a small subset of a syntax for RDF),
B<PERL> (a single top-level array or hash assignment),
B<SEXP> (LISP/Scheme S-expressions),
B<XML> (simple HTML table structures, using any tag names),
Formats that store data as binary fields (n-byte integers and float,
packed bits, length-prefixed strings, and such) are not supported,
though there's no particular reason they can't be added.
C<TabularFormats> is a way to move simple tabular data around!
Representing general XML, JSON, programming-language structure declarations,
or OWL in CSV-like formats
is awkward at best, and this script doesn't deal with data that complex.
For details on the particular formats,
see below under L</"Supported formats, with examples">.
First details on the API, see C<TabularFormats.pm>, which is who you
actually call. It forwards to the appropriate format's support code in
TFormatSupport as needed.
B<Note>: If an input format can have more than one physical line per
logical record (such as I<some> CSVs, MIME, ARFF, XSV, etc.), then definitely
use I<readRecord>() instead of just Perl reads, to make sure you get a
whole logical record each time. For example:
B<Note>: Some of the supported formats (such as JSON, SEXP, and XML)
can express much more complex structures than the others,
such as hierarchies rather than merely records and fields.
Only certain simple subsets of those formats are supported here:
the I<least common denominator>, if you will.
For example, XML support is limited to table-like structures (though you
can change tag names), with little but text inside individual cells; and
does not yet use a real parser (though it's pretty close).
JSON and SEXP are supported in similar fashion.
=head2 Additional formats to add:
mediaWiki table markup (do with CSV with I<fieldSep> C<||>?)
Perl 'unpack'
ARC (Internet Archive)
Python (tweak Perl support?)
RDF "Turtle"
Sparkql results
Google pbufs (unusual; unparseable without their schemas)
Excel(r)'s xlsx (XML) or even binary formats.
table = sheetData
tr = row @r="rownum"
td = c @r="A1etc" @t="s for string, n for numeric"
(link to sharedtext when string?)
inside c is v, containing a number: if cell is string, it's that
item from sharedStrings.xml (0-base, <r> element,
w/ <t> or <rPr> with markup).
move sharedStrings into the sheet; change tags to HTML.
=head1 Options
See C<TabularFormats.pm> for additional information.
All options are always available to I<setOption> and I<getOption>,
even those that are specific to formats not in use.
Using an unknown option name will fail.
You can set an option to 0 or "", but not to C<undef>.
Unless otherwise specified below, options with unquoted default values
are boolean, and options with quoted values are strings.
=head3 General Options
=over
=item * B<comment> (string, depending on basicType) --
Ignore records as comments if they begin with this string.
Some formats permit comments other than at start of line;
they are not all fully supported here yet).
The default value for I<comment> varies with the format:
"%" for ARFF,
None for COLUMNS,
"#" for CSV,
"//" for JSON,
"#" for Manchester,
None for MIME,
"#" for PERL,
";" for SEXP,
"<!--...-->" for XSV,
"<!--" for XML.
=item * B<prettyPrint> (boolean) 1
Output line-breaks and/or indentation to pretty-print output
(the details depend on the I<basicType> in effect).
=item * B<stripFields> (boolean) 1
Discard leading/trailing whitespace on fields.
=item * B<stripRecords> (boolean) 1
Discard leading/trailing whitespace on records.
=item * B<TFverbose> (integer) 0 # Show more trace information?
=back
=head3 Options for ARFF format
=over
(none)
=back
=head3 Options for COLUMNS format
=over
(none) However, you must use I<setFieldPosition>() to say where each field goes.
=back
=head3 Options for CSV format
=over
=item * B<recordSep> (string) "\n"
Record separator
=item * B<fieldSep> (string) "\t"
Field separator
=item * B<delim>
Synonym for I<fieldSep>
=item * B<escape> (string) ""
Character used to escape others, typically
backslash (C<\>, d92) or escape (C<\e>, 0d27).
On input a generous range of escapes is expanded (there is presently
no way to disable particular case).
They are shown here, assuming I<escape> is set to backslash
(see unescapedValue() for the implementation):
B<\a> -- bell (U+0007)
B<\b> -- backspace (U+0008)
B<\e> -- escape (U+001b)
B<\f> -- form feed (U+000C)
B<\n> -- line feed (U+000A)
B<\r> -- return (U+000D)
B<\t> -- tab (U+0009)
B<\0> -- null (U+0000)
B<\\> -- backslash (U+005C)
B<\777> -- 3-digit octal character code points
B<\xFF> -- 2-digit hexadecimal character code points
B<\uFFFF> -- 4-digit hexadecimal character code points
B<\UFFFFFFFF> -- 8-digit hexadecimal character code points
B<\x{F...}> -- n-digit hexadecimal character code points
=item * B<escape2hex> (character) ""
Character used to mark hex-escapes,
such as "%" for URI-style escapes. This applies after I<escape>.
=item * B<header> (boolean) 0
Is record 1 a header with field names?
=item * B<nlInQuotes> (boolean) 0
Allow newline in quotes?
B<Note>: With this option, you must read records via
I<readRecord>(); using normal Perl reads will of course not get
the multiple physical records that make up one logical record.
If this option is I<not> set, the library still checks for unbalanced
quotes, and issues a warning if found.
=item * B<qdouble> (boolean) 0
Can embedded quotes be expressed by doubling them? Default: off.
If I<escape> is also set, it takes precedence over I<qdouble>.
=item * B<quote> (string) "\""
The character used to quote more complex field values (particularly those
containing the I<fieldSep> character, or newlines if I<nlInQuotes> is in effect.
=item * B<tableSep> (boolean) ""
Start a whole new table on seeing this.
(unused) (not yet fully supported).
=back
=head3 Options for JSON format
=over
=item * B<jsonArray> (boolean) 0
Should the JSON top level object be an array, or a hash?
=back
=head3 Options for Manchester OWL format
=over
=item * B<classField> (name) "Type"
Treat the specified field as the "Class" (this matters when converting other
data to Manchester, since the name "Class" is special in OWL).
=back
=head3 Options for MIME format
=over
(none)
=back
=head3 Options for PERL format
=over
(none)
=back
=head3 Options for SEXP format
=over
(none; may add choice of alist vs. list, or SXML support)
=back
=head3 Options for XML format
The "XML" format here supports XML structurally like (X)HTML tables,
except that you can change tag names. I<readRecord> reads up to the
next end-tag for the I<trTag> value (default: "tr"). Parsing is overly
simplistic, but better if you use I<readAndParseRecord>
instead of I<readRecord> followed by I<parseRecord>.
HTML is somewhat supported, but you're better off running the data through
C<tidy> or similar first.
=over
=item * B<htmlTag>, B<tableTag>, B<theadTag>, B<tbodyTag>,
B<trTag>, B<tdTag>, B<thTag>.
These options can be used to replace the default HTML element type names
for XML input and output. In general, tags set to "" will be ignored/omitted.
=item * B<attrFields> (string) ""
A whitespace-separated list of XML attribute names. These attribute
(regardless of what elements they occur on), will be treated as additional
fields, with the same names. An obviously useful one could be C<href>.
For input, if such attributes occur on multiple elements within a "record"
(in effect, within a table row), then the result is undefined.
Probably you'll get the last one.
For output, there is no way to put certain fields into attributes
rather than cell-like elements (yet).
This option may be extended to allow specifying an element name, child number,
or associtated C<class> attribute value (for example, C<h1@class>).
=item * B<idAttr> (string) "id"
Use this name for ID attributes generated for output XML.
See I<idValue>, which must be set for this to have any effect.
=item * B<idValue> (string) ""
If not "", turns on generation of ID attributes for each output row in XML.
The attribute name is taken from I<idAttr>, while I<idValue> specifies
where to get the value:
=over
=item * if it is a field name, that field's value will be used
(and that field will not be written out as a regular field);
=item * if it is a token ending in "*", the "*" will be replaced by
the row number in the table (counting from 1), and the result will be used
as the ID.
=back
=item * B<classAttr> (name) "class"
Specifies the attribute of the "td" (field) elements,
onto which the field-name will be put (typically, this is so HTML table
output gets C<td> elements distinguished by a separate C<class> value
for each column, rather than being undistinguished.
If this is set to "", use field-names as the I<element type names> for fields
(instead of just using C<td> or the I<tdTag> value).
=item * B<colspecs> (boolean) 0
Generate HTML table COL elements?
For this to be very useful, you'll probably want to use I<setFieldPosition>().
The column specifications can include width and alignment. You can also
use the I<classAttr> option to put field names in as class attributes on
cells, and use that to hook up style definitions.
=item * B<entityWidth> (int) 5
Minimum digits to write for numeric character references, as an integer.
=item * B<entityBase> (10|16) 16
Base for writing numeric character references, as an integer.
=item * B<HTMLEntities> (boolean) 0
Use HTML entity names when applicable? (not yet supported)
=item * B<publicId> (string) ""
Write out a DOCTYPE declaration, with this PUBLIC id, and
with the document type name taken from I<htmlTag>.
=item * B<systemId> (string) ""
Write out a DOCTYPE declaration, with this SYSTEM id, and
with the document type name taken from I<htmlTag>.
=item * B<XMLEntities> (boolean) 1
Use the 5 XML built-in entities (if turned off, then
use numeric character references even for escaping those 5 characters).
=item * B<XMLDecl> (boolean) 0
Write out an XML Declaration.
=back
=head3 Options for XSV
=over
=item * B<typeCheck> (boolean) 0
Check conformity to any datatypes
that are declared in the XSV <Head> (see L<XmlTuples.pm> for details).
=back
=head1 Methods
Each format-support package inherits from TFormatSupport.
It may inherit methods from there, and it overrides at most these methods:
new()
isOkFieldName(fname) (usually just inherits)
readAndParseHeader()
readRecord() (usually just inherits)
parseRecordToArray(s) (usually just inherits)
parseRecordToHash(s)
assembleRecordFromArray()
assembleField()
assembleComment()
assembleRecordFromHash() (usually just inherits)
assembleHeader() (usually just inherits)
assembleTrailer() (usually just inherits)
These can all be reached via the C<TabularFormats> API, which generally
forwards them down to the format-support package currently in use,
except readAndParseHeader(), for which the C<TabularFormats> wrapper also
checks and fixes field names, and defines the named fields.
See the documentation there for how to use them.
=head1 Supported formats, with examples
Whitespace below (other than line-breaks)
has been inserted just for readability, and is not required
except where specially noted.
These examples are provisional, while I work out the details.
=head2 ARFF
This is the I<Attribute-Relation File Format> form for the C<WEKA> ML tookit
(see "References", below).
A logical data record in ARFF is just a physical line.
No comments.
This format begins with a multi-line C<@RELATION> section (the header),
The fields are called "attributes", and declare a name and
a datatype chosen from:
=over
=item * C<NUMERIC> -- real or integer numbers.
C<INTEGER> and C<REAL> are synonyms for C<NUMERIC>.
=item * C<STRING> -- single- or double-quoted if the value contains
spaces and/or commas (and/or curly braces?) How to put a single and/or double
quote inside a field delimited by double and/or single quotes, is unclear.
This script will fail for fields that contain I<both> kinds of quotes.
=item * C<< DATE [<format>] >> -- the format is optional, and defaults to
ISO 8601 combined form: C<yyyy-MM-dd'T'HH:mm:ss>. The ARFF documentation
is unclear whether the square and/or pointy-brackets are literal.
Formats supported are those of Java's C<SimpleDateFormat> class (see
L<http://docs.oracle.com/javase/1.4.2/docs>)
=item * C<{ name1, name2,... }> -- a "nominal-specification" or enum.
Values can contain spaces (and commas?), but then must be quoted.
Values are defined to be case-sensitive (though that doesn't matter
merely for parsing).
=back
Data follows, after a line beginning C<@DATA>, and is essentially a CSV.
Field values must be (single- or double-) quoted if they contain spaces
(or presumably commas).
Spaces outside of quotes are discarded.
C<?> is reserved for "missing values" (probably a literal "?" can be
expressed by quoting?).
There is a "sparse" data format as well, where each sparse record is enclosed
in curly braces (presumably, that also means values must be quoted if they
contain curly braces). For example:
{ 1 John, 3 MA }
The values inside sparse ARFF records are pairs, consisting of a field
number (counting from 0!) and a value (fields not specified are 0, not "?".
This script accepts but never generates sparse format.
Weka itself has a bug dealing with field 0 in sparse format.
This script has a bug dealing with commas within quotes in ARFF.
Example:
% Signers data
%
@RELATION DeclarationOfIndependence
@ATTRIBUTE Fname STRING
@ATTRIBUTE LName STRING
@ATTRIBUTE State { PA, MA, RI, DE, NH, VT, VA }
@DATA
John, Adams, MA
Benjamin, Franklin, PA
John, Hancock, MA
Stephen, Hopkins, RI
Andries, 'van Dam', RI
=head2 COLUMNS
Fixed column-oriented layout. To use this, you'll need to call
I<setFieldPosition>() to define column placements.
A logical data record in COLUMNS is just a physical line.
No comments allowed, although I<header> can be set.
Whitespace I<does> matter in COLUMNS format.
Fname LName State
John Adams MA
Benjamin Franklin PA
John Hancock MA
Stephen Hopkins RI
Andries van Dam RI
=head2 CSV
A wide variety of record/field delimited file formats, such as
CSV and TSV. "CSV" data varies in:
the choice of delimiter character,
repeatability of the delimiter (usually only for space),
whether spaces are ignored (especially at start of record)
whether fields can be quoted (or I<must> be),
whether and how quotes can appear within quotes,
whether newlines can appear within quotes,
whether the first records is a header giving field names,
and so on.
A logical record in CSV is a physical line unless I<nlInQuotes> is set,
in which case newlines can appear inside quotes as part of the data.
No comments are typically allowed, though this script does support them
if needed (see I<setOption>()).
For example, with I<fieldSep> set to "," and I<quote> to "\"":
Id, Fname, LName, State
Signer01, John, Adams, MA
Signer02, Benjamin, Franklin, PA
Signer03, John, Hancock, MA
Signer04, Stephen, Hopkins, RI
Signer05, Andries, "van Dam", RI
=head2 JSON
Javascript Object Notation, commonly used for passing
program data structures around.
A logical data record in JSON is essentially
a Javascript expression with balanced (), [], and/or {}.
Quoted contents doesn't count toward balancing.
JSON does not formally allow comments, although some implementations do,
as does the potentially dangerous but common technique of simply "eval()'ing"
a JSON expression in JavaScript. This script will ignore
physical lines in JSON that match /^\s*\/\/).
{ "Table": [
"Signer01": {"Fname":"John", "LName":"Adams", "State":"MA" }
"Signer02": {"Fname":"Benjamin", "LName":"Franklin", "State":"PA" }
"Signer03": {"Fname":"John", "LName":"Hancock", "State":"MA" }
"Signer04": {"Fname":"Stephen", "LName":"Hopkins", "State":"RI" }
"Signer05": {"Fname":"Andries", "LName":"van Dam", "State":"RI" }
]}
=head2 Manchester
The Manchester OWL (Web Ontology Language) format is
used by C<Protege> and some other C<RDF> applications
(see "References" below).
This script only supports The Manchester "IndividualFrame" item,
for assigning Class, SubClassOf, and Facts to the individuals.
As with XML, JSON, and some others, this represents a
"least common denominator" subset, comparable to CSV and its kin.
A logical data record in Manchester is here defined (for now) as an
item which extends from wherever the input stream is at, up to
just before the next I<frame> keyword: "Individual:", "Datatype:", etc.
Full-line comments are discarded, but not part-line comments.
# Some OWL data in Manchester format.
# For TabularFormats, the "Prefix" and "Class" items are discarded.
#
Prefix: : http://www.example.org/mystuff
Class: Signer
SubClassOf: owl:Thing
Individual: Signer01
Types: Signer
Facts: Fname John, LName Adams, State MA
Individual: Signer02
Types: Signer
Facts: Fname Benjamin, LName Franklin, State PA
Individual: Signer03
Types: Signer
Facts: Fname John, LName Hancock, State MA
Individual: Signer04
Types: Signer
Facts: Fname Stephen, LName Hopkins, State RI
Individual: Signer05
Types: Signer
Facts: Fname Andries, LName "van Dam", State RI
=head2 MIME
MIME mail header form (incomplete).
See L<RFC 1521>, L<RFC 2045>, L<RFC 822>.
Uses I<label:>-prefixed fields (with continuation lines indented), and
a blank line (only) before each entire record. For example:
Name: Alexander
the Great
Nationality: Greek
A logical data record in MIME is a series of physical lines, up to a blank line.
Each "record" in this case is treated as comparable to a mail message,
with each field expressed by a header line (and possibly continuation lines,
which must be indented). No comments are allowed.
Id: Signer01
Fname: John
LName: Adams
State: MA
Id: Signer02
Fname: Benjamin
LName: Franklin
State: PA
Id: Signer03
Fname: John
LName: Hancock
State: MA
Id: Signer04
Fname: Stephen
LName: Hopkins
State: RI
Id: Signer05
Fname: Andries
LName: van
Dam
State: RI
This script does not insert headers such as MIME-Version, Content-Type,
Content-Transfer-Encoding, etc. (though perhaps it should, optionally).
=head2 PERL
This is mainly for output. It produces PERL source code that creates
an array (one element per record) of references to hashes (which map from
field names to values).
All the field names, and all non-numeric values, are quoted as strings.
You should be able to just paste this into a PERL program and then access
the data easily.
A logical data record in PERL input is here considered to be everything up to
the next unquoted/non-comment semicolon
(slightly simplper than PERL reality, but fairly close).
Comments run from any unquoted "#" to end of line.
For output, each "record" begins a new physical line.
# Some data.
#
my @foo = (
( "Fname" => "John",
"LName" => "Adams",
"State" => "MA",
),
( "Fname" => "Benjamin",
"LName" => "Franklin",
"State" => "PA",
),
( "Fname" => "John",
"LName" => "Hancock",
"State" => "MA",
),
( "Fname" => "Stephen",
"LName" => "Hopkins",
"State" => "RI",
),
( "Fname" => "Andries",
"LName" => "van Dam"",
"State" => "RI",
),
);
=head2 SEXP
S-Expressions should be familiar from LISP, Scheme, and their kin.
They are supported in two flavors: association lists vs. plain lists.
Field names and values are quoted (single left only if they consist only
of alphanumerics; otherwise double enclosed):
A logical data record in SEXP is data up to and including the balancing ")".
Parentheses inside quotes do not count toward balancing.
The final ")" may occur at mid-line; later data on that line is part of
the next SEXP. Physical lines matching /^\s*;/ are ignored (comments).
See also C<sexp2xml>, which is similar but also handles a variety of
additional features used in Penn TreeBank files.
; My data as an S-expression
;
(
('Table
('Signer01 ('Fname . 'John) ('LName . 'Adams) ('State . 'MA))
('Signer02 ('Fname . 'Benjamin) ('LName . 'Franklin) ('State . 'PA))
('Signer03 ('Fname . 'John) ('LName . 'Hancock) ('State . 'MA))
('Signer04 ('Fname . 'Stephen) ('LName . 'Hopkins) ('State . 'RI))
('Signer05 ('Fname . 'Andries) ('LName . "van Dam") ('State . 'RI))
)
(
('Signer01 'John 'Adams 'MA)
('Signer02 'Benjamin 'Franklin 'PA)
('Signer03 'John 'Hancock 'MA)
('Signer04 'Stephen 'Hopkins 'RI)
('Signer05 'Andries "van Dam" 'RI)
)
)
I<SXML> syntax (see L<"References">)
is a variation on SEXP that is not yet supported.
=head2 XML
(X)HTML or XML table or table-like markup.
Elements for each record and field and field values in content.
HTML table tags are used by default, but tag names can be changed.
Attributes are not presently used for fields
(except see I<idAttr>, I<idValue>, and I<classAttr>), though this
is a desirable addition.
A logical data record in XML is here defined as an
entire <tr> element (or another name, if I<trTag> has been
changed from the default). I<readRecord> returns other things in SAX-like units
(tags, comments, text, etc.)
<?xml encoding="utf-8" version="1.0"?>
<html>
...
<table>
<tr Class="MySigner" id="Signer01">
<td class="Fname">John</td>
<td class="LName">Adams</td>
<td class="State">MA</td>
</tr>
<tr Class="MySigner" id="Signer02">
<td class="Fname">Benjamin</td>
<td class="LName">Franklin</td>
<td class">State">PA</td>
</tr>
<tr Class="MySigner" id="Signer03">
<td class="Fname">John</td>
<td class="LName">Hancock</td>
<td class="State">MA</td>
</tr>
<tr Class="MySigner" id="Signer04">
<td class="Fname">Stephen</td>
<td class="LName">Hopkins</td>
<td class="State">RI</td>
</tr>
<tr Class="MySigner" id="Signer05">
<td class="Fname">Andries</td>
<td class="LName">van Dam</td>
<td class="State">RI</td>
</tr>
</table>
...
I plan to add options to allow you to switch to other popular schemas
in one step (rather than setting the tag names individually), and
to provide for treating chosen XML attributes also as fields.
I don't anticipate supporting column or row spans here at all
(though see my C<xml2tab> for some support).
HTML Docbook NLM TEI CALS
----------------------------------------
TABLE table table table table
COLGROUP colgroup
COL colspec @rend colspec
THEAD thead thead head
TBODY tbody tbody
TFOOT tfoot tfoot
TR row tr row row
TD entry td cell entry
TH entry th cell entry
CAPTION caption
@CLASS @role
=head2 XSV
XSV is a very simple subset
of XML, limited to about the same functionality as CSV, ARFF, etc.
It does, however, support datatype checking,
as well as default values, C<HTML>-like "BASE" factoring,
and a few other options that can save a great deal of space.
It is supported via its own package, C<XmlTuples.pm>.
Every XSV data set is a Well-Formed XML document (so can be processed
by perfectly normal XML software). However, not all WF XML documents
are XSV. In other words, XSV supports a subset of XML.
An XSV document has a single <Head> element, or a sequence of <Head> elements
contained in a single <Xsv> element (which may have Dublin Core attributes).
A <Head> in turn contains any number of (empty) <Rec> (record) elements,
with an attribute for each field.
XML comments are also allowed.
Each tag and comment must begin on a new physical line
(this is permitted but by no means required in XML).
The <Head> element lists the attributes (fields) that are permissible for
the <Rec> elements it contains. The value of each attribute on <Head>
(possibly just ""),
is the default for the like-named attribute on contained <Rec> elements,
except that if the value begins with "#" it specifies a datatype,
one of the XSD built-in ones, or a few others such as enums.
See C<XmlTuples.pm> for details.
For example:
<!-- Some signers of the Declaration of Independence.
List created: March 15, 1066 A.D.
-->
<Xsv creator="george@wasington.gov" title="The Framers">
<Head Id="#NMTOKEN" Fname="#string" LName="John"
State="" DOB="#date">
<Rec Id="Signer01" Fname="John" LName="Adams" State="MA" />
<Rec Id="Signer02" Fname="Benjamin" LName="Franklin" State="PA" />
<Rec Id="Signer03" LName="Hancock" State="MA" />
<Rec Id="Signer04" Fname="Stephen" LName="Hopkins" State="RI" />
<Rec Id="Signer05" Fname="Andries"
LName="ván Dam" State="RI" />
</Head>
</Xsv>
=head1 Managing options
=over
=item * B<addOptionsToGetoptLongArg>I<(hashRef, prefix?)>
Add all of this package's options to the hash at I<hashRef>, in the form
you would pass to Perl's C<Getopt::Long> package. The options will be set
up to store their values directly to the TabularFormats instance, via
the I<setOption>() method.
If I<prefix> is defined,
it will be added to the beginning of each option name; this allows you
to avoid name conflicts with the caller, or between multiple instances of
TabularFormats (for example, one for input and one for output).
If an option is already present in the hash (note that the key, as always
for C<Getopt::Long>, includes aliases and suffixes like "=s"), a warning is
issued and the new one replaces the old.
=back
=head1 Known bugs and limitations
=over
=item * Not safe against UTF-8 encoding errors. Use C<iconv> if needed.
=item * Leading spaces on records are not reliably stripped.
=item * The I<ASCII> option is supported for JSON, Perl, XML, and XSV.
For some other formats it is not clear how to escape non-ASCII characters.
ARFF appears to have no such method at all.
MIME headers use I<quoted-printable> form, but support for full Unicode is not
yet finished.
=item * Support for decoding HTML entity references is implemented but
commented out; to use it, uncomment things starting C<HTML::Entities>
and install the eponymous CPAN package.
=item * Datatype checking is experimental.
=item * The behavior if using regexes rather than strings for I<fieldSep>,
I<quote>, I<comment>, etc. for CSVs is undefined.
Most likely it will work ok for input, but not for output.
=item * The behavior if a given field is found more than once in an input
record is undefined. This is only possible with some formats (essentially
those that identify fields by name, not position). Some options
may be added for this, perhaps taking the first or last, or concatenating
them with some separator, or serializing them somehow.
=item * B<ARFF> (WEKA) support has a few bugs: It may fail to parse records
that contain commas. Quoted fields that contain both single- and double-quotes
will likely fail for both input and output.
=item * B<COLUMNS>: widths can end up undefined if the caller uses
I<setFieldPosition>() and omits it, and defines columns in an order other
than right-to-left.
=item * B<CSV>: using both C<quote> and C<escape> may get unhappy.
In particular, the behavior of \"" and \\" are undefined.
Perhaps options I<qdouble> and/or I<nlInQuotes> should default on.
=item * B<JSON>: No exponential notation. No control for whether
output records are enclosed as an array or a dictionary.
=item * B<Manchester>: barely implemented (but see I<xml2tab>).
=item * B<MIME>: experimental, and has no support for internal
structure within MIME lines. Perhaps it should use Multipart?
=item * B<SEXP>: sub-types/options are not yet implemented.
=item * B<XML>: Plain HTML is not yet supported; only XHTML
(use C<tidy> if needed).
ID-related options can generate an ID attribute from a specified
field. However, that field will still be generated as a table column, too.
Support for getting/putting fields from/to attributes is experimental
(see the I<attrFields> option).
=back
=head1 References
L<http://weka.wikispaces.com/ARFF> "ARFF".
L<http://tools.ietf.org/html/rfc4180> --
"Common Format and MIME Type for Comma-Separated Values (CSV) Files".
L<http://www.json.org/> --
"Introducing JSON".
L<http://www.w3.org/TR/owl2-manchester-syntax/>) --
Matthew Horridge and Peter F. Patel-Schneider.
"OWL 2 Web Ontology Language: Manchester Syntax."
W3C Working Group Note, 27 October 2009.
L<http://www.oasis-open.org/specs/tm9901.html> --
(OASIS) "XML Exchange Table Model Document Type Definition".
L<http://cran.r-project.org/doc/manuals/R-data.pdf> --
"R Data Import/Export".
L<http://en.wikipedia.org/wiki/SXML> --
(Wikipedia article on) "SXML".
L<http://search.cpan.org/~msergeant/XML-Parser-2.36/Parser.pm> --
(CPAN documentation for XML::Parser).
=head1 History
# Original, 2010-03-23 by Steven J. DeRose, as csvFormat.pm
# (many changes/improvements).
# 2012-03-30 sjd: Rename to TabularFormats.pm, major reorg.
# 2013-02-11 sjd: Split format-support packages and doc out to here.
# 2013-04-02f sjd: Lose setFieldNamesFromArray() calls, let TabularFormats.pm