66816 – GAWK Support

Issue 66816 - GAWK Support

Summary: GAWK Support

Status:	CONFIRMED

Alias:	None

Product:	Calc
Classification:	Application
Component:	code (show other issues)
Version:	OOo 2.0.2
Hardware:	All All

Importance:	P3 Trivial with 1 vote (vote)
Target Milestone:	---
Assignee:	AOO issues mailing list
QA Contact:

URL:
Keywords:

Depends on:
Blocks:

Reported:	2006-06-28 11:34 UTC by discoleo
Modified:	2013-08-07 15:12 UTC (History)
CC List:	2 users (show)

See Also:
Issue Type:	ENHANCEMENT
Latest Confirmation in:	---
Developer Difficulty:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this issue.

Description discoleo 2006-06-28 11:34:27 UTC

GAWK
----

There are many excellent programs that perform specific tasks extremely well.
OOo would not be able to surpass their functionality (actually this isn't even
needed), therefore I recommend integrating (facilitating) their use in OOo.

Indeed, OOo should build on existing free software, too. I already reported in
issue http://www.openoffice.org/issues/show_bug.cgi?id=66589 the potential of
the free statistical R-package. I will elaborate here on another excellent
program, gawk.

CLARIFICATION: when I say tight integration, I do NOT mean to use the code from
that program inside OOo. Instead, OOo should be able to communicate (e.g.
through pipelines) directly with that program and the user should be able to use
that program within OOo, without having to establish the connection himself.

WHY GAWK?
Gawk is a free software available for both UNIX and MS Windows (project GnuWin32
on Sourceforge.net). It allows one to automate complex text-manipulation tasks
beyond the possibilities of ordinary scripts (like VB or javaScript). UNIX users
would be particularly happy to use it.

There are 2 potential uses for gawk in OOo:

1.) the "Cells/Rows" architecture in Calc is obviously inviting for gawk use.

I often came across the need to perform complex string manipulations,
unfortunately Calc offers very limited possibilities here. In addition, some
bugs with the find function (see my issue
http://qa.openoffice.org/issues/show_bug.cgi?id=66590 ) complicate this further.
Making complex computations inside the Calc worksheets does not allow for
automation (use in a different worksheet), while the classical macros are really
limited in scope. This is where gawk makes the difference.

How to implement this?
 - Calc should open a bi-directional pipeline to gawk (gawk supports this)
 - the gawk script should be written and saved as a macro/script (therefore one
additional level of automation)
 - IF no FieldSeparator (FS) or RecordSeparator (RS) are specified in the
BEGIN-section of the gawk script, Calc should set some default values, which
should be also used to split (join) the Cells and Rows in the worksheet when
pipelining the data stream into gawk
 - these same values (FS & RS) should be used to split the data back into cells
when importing the processed data back into Calc (through the bi-di pipeline).

2.) the second use of gawk is obviously in Writer.

The advantages of gawk are again versatility and suitability for complex tasks
and automation, but also its speed.

The implementation should be similar to that described previously, with the
exception that RS should delimit paragraphs while FS should be left default
(=space).

An advanced feature would be to implement 2 modes for Writer to parse the text:
 - as plain text (no formatting, just splitting into paragraphs)
 - as xml-tagged text for more advanced processing (include text
styles/formatting, but not as comprehensive as in the saved file)

Comment 1 frank 2006-06-30 10:35:30 UTC

As enhancement re-assigned to requirements

Comment 2 discoleo 2006-06-30 11:22:57 UTC

EXAMPLE OF GAWK USE

Here is a real-life example showing the usefulness of gawk/awk:

I worked recently on a patient DB and wanted to create some dummy variables for
the hospital unit (patient category).

GAWK SCRIPT
($1 contains the input - the hospital unit)
$2 = 0 # neurosurgery vs non-neurosurgery
$3 = 0 # neurology vs non-neurology
$4 = 0 # general surgery vs non-surgery
$5 = 0 # internal maedicine vs non-im
$6 = 1 # ERROR var, if unknown abreviation

$0 = tolower($0)

# NEUROSURGERY
/nch/ {$2 = 1, $6 = 0 }

# Neurology
/^n$|^ne/ {$3 = 1, $6 =0 }

# General Surgery
/^ch/ {$4 = 1, $6 =0 }

# INTERNAL MEDICINE
/mi|end|nut/ {$5 = 1, $6 =0 }

print $0 >> 'out-file'

### END SCRIPT

 - this simple script does exactly what I wanted in a very simple fashion,
AND
 - it took me less than 5 minutes to write it!!!

The execution is almost instantly even on big files (~1 MB text file).

Unfortunately, I didn't manage to get this same thing done using only Calc's
functionality. (One reason is the problem with the find() function described
previously; this severe limitation of Calc hampers any serious work with strings.)