This is a continuation of my blog
A Way of Reading Huge Excel Files in which I described the usage of the sXML parser for an alternative Excel file reader with a massively reduced memory footprint. I followed the suggestion of some of my blog readers (
kay.streubelrainer.hbenthal) to add the code to the abap2xlsx repository at
ivanfemia/abap2xlsx · GitHub. The abap2xlsx package now contains a subclass ZCL_EXCEL_READER_HUGE_FILE of the existing reader class ZCL_EXCEL_READER_2007.
I owe special thanks to ivan.femia for helping me with the first steps into the abap2xlsx realm. 🙂
Only the sheet data and shared string table are parsed differently in ZCL_EXCEL_READER_HUGE_FILE, as these usually are the heavy-load parts of an Excel file. The other parts - like style details - continue to be analyzed in the superclass and are therefore parsed with iXML parsers, as before.
How to Use It
The public interface - and therefore the usage - is the same as for ZCL_EXCEL_READER_2007. If you have an XSTRING with the .xlsx file content, you work with the LOAD( ) method; if you have a filename and want abap2xlsx to read it (from application or presentation server), you would use LOAD_FILE( ).
The test report ZZ_TEST_EXCEL_READER (code below) shows the idea.
It's Just Another Object Type
You can use the report to load an excel file from the presentation or application server, doing the parsing with the "standard" reader or with the new "huge file reader". As you see, the difference between the two reader versions is only in the CREATE OBJECT lo_reader TYPE ... statement.
Decision to Read Files From Presentation Server or Application Server
The possibility to decide between application server and presentation server explicitly is a new feature. Formerly, this distinction was made automatically, depending on the SY-BATCH flag. This will still be the default value for the new optional import parameter I_FROM_APPLSERVER. But now, the distinction can also be made by the caller.
It's good practice to have the reader as a local object in an own, separate code unit (here: a form routine). This way, all the memory that was necessary during the read process will be freed when the excel object has been built and the reading unit is left.
report zz_test_excel_reader.
parameters:
p_file type string lower case default `C:\temp\test.xlsx`,
p_appl type flag,
p_huge type flag.
start-of-selection.
perform start.
* ---
form start.
data: lo_excel type ref to zcl_excel,
lo_excpt type ref to zcx_excel.
try.
perform read using p_file p_appl p_huge
changing lo_excel.
break-point. " <<<--- Inspect lo_excel here
* Parsed data are usually contained in the following ITAB:
* LO_EXCEL->WORKSHEETS->WORKSHEETS->COLLECTION[1]->SHEET_CONTENT
catch zcx_excel into lo_excpt.
message lo_excpt type 'I'.
endtry.
endform. "start
* ---
form read using i_filename type csequence
i_from_applserver type flag
i_huge type flag
changing e_excel type ref to zcl_excel
raising zcx_excel.
* Use the reader instance as a local variable in a separate unit,
* so its memory will be released after leaving the unit.
data: lo_reader type ref to zif_excel_reader.
if i_huge eq 'X'.
create object lo_reader type zcl_excel_reader_huge_file.
else.
create object lo_reader type zcl_excel_reader_2007.
endif.
e_excel = lo_reader->load_file( i_filename = i_filename
i_from_applserver = i_from_applserver ).
endform. "read
The Main Loop
An Excel workbook is saved as a zip archive containing several XML files. The files
sheet1.xml,
sheet2.xml etc. contain the actual worksheet data, among other things like style definitions. Here is a typical example of a sheet.xml: As you see, the data start with a separate element named sheetData; in this example, the cells contain pointers to the shared string table (which is defined in another file
sharedStrings.xml).
The parser (we use "token-based sXML parsing", see
ABAP Keyword Documentation) for this file can safely skip forward until it meets the opening of element
sheetData. Once this is reached, it should read elements and their contents one after another, as follows:
- When an opening c element is detected, a new cell data structure has to be filled from the element c's attribute values;
- When a text content of an element is detected, this content should usually be treated as value of the current cell
- in some cases, however, the value has to be fetched by index from the shared string table
- When a closing c element is detected, the current cell data should be added into the ABAP worksheet representation
- When a closing sheetData element is detected, the loop should be left.
And this is the above text written in ABAP:
method read_worksheet_data.
data: ls_cell type t_cell.
* Skip to <sheetData> element
skip_to( iv_element_name = `sheetData` io_reader = io_reader ).
* Main loop: Evaluate the <c> elements and its children
while io_reader->node_type ne c_end_of_stream.
io_reader->next_node( ).
case io_reader->node_type.
when c_element_open.
if io_reader->name eq `c`.
ls_cell = fill_cell_from_attributes( io_reader ).
endif.
when c_node_value.
case io_reader->name.
when `f`.
ls_cell-formula = io_reader->value.
when `v`.
if ls_cell-datatype eq `s`.
ls_cell-value = get_shared_string( io_reader->value ).
else.
ls_cell-value = io_reader->value.
endif.
when `t` or `is`.
ls_cell-value = io_reader->value.
endcase.
when c_element_close.
case io_reader->name.
when `c`.
put_cell_to_worksheet( is_cell = ls_cell io_worksheet = io_worksheet ).
when `sheetData`.
exit.
endcase.
endcase.
endwhile.
endmethod.
A caveat for the sXML parser is necessary: the instance attributes like
name,
value, etc. are never initialized. The caller has to take care which values are outdated, by looking at the three events "element open", "element close" and "node value".
When using the sXML reader, special caution is necessary for elements with mixed content. In the example,
<x><y>v</y>w</x>
the following events will be raised:
Event co_nt_element_open name: x value:
Event co_nt_element_open name: y value:
Event co_nt_value name: y value: v
Event co_nt_element_close name: y value: v
Event co_nt_value name: y value: w
Event co_nt_element_close name: x value: w
Event co_nt_final name: x value: w
Observe that when the event for value '
w' is raised, the name attribute still contains '
y', although
y has already been closed and thus is an outdated value: The sXML parser doesn't keep track of this outdating - if it would, it had to keep track of the document structure (for example by maintaining a stack of the parents of the current node). In the sXML philosophy, however, all things regarding knowledge of the document structure, are left to the caller - the reading process is restricted to the detection of opening and closing elements, of attribute values, text nodes etc.
A parser is an ideal candidate for unit tests, since its output depends only on the input stream - there are usually no dependencies on other states (database, API calls, ...). So I implemented all code of the reader using unit tests.
Testing Private Methods
The huge file reader has no public methods on its own - the only public methods are inherited from ZCL_EXCEL_READER_2007, which is out of the scope of my tests. For this reason, I decided to test private methods. There is a price for this decision: We are not as free as it could be in renaming and deleting private methods, since all these changes also require the test classes to be adapted.
To make private methods testable, we start the unit test section with a declaration of friendship:
class lcl_test definition deferred.
class zcl_excel_reader_huge_file definition local friends lcl_test.
Reading a Number
With some help methods in the test class, I let the code of a test look more intentional and more to-the-point: Everybody can read from the code of the following test method that the cell datatype "number" is under test: A cell with numerical content (17) should propagate the worksheet in the expected way:
**
method test_number.
data lo_reader type ref to if_sxml_reader.
lo_reader = get_reader(
`<c r="A1" t="n"><v>17</v></c>`
).
out->read_worksheet_data( io_reader = lo_reader io_worksheet = worksheet ).
assert_value_equals( `17` ).
assert_datatype_equals( `n` ).
endmethod.
Here,
OUT is the "object under test". As usual for unit tests, a new reader instance is created for each test method in the setup method. The same holds for the worksheet instance here, which holds the result.
Many tests just call the private central method
read_worksheet_data( ) and then inspect the produced worksheet instance. This method imports an instance of an sXML reader to read the input data, and an instance of ZCL_EXCEL_WORKSHEET to write the result. Apart from the expected changes in the worksheet's state, there are no side effects on other data. And the only further external dependency is that the table of shared strings (supposed to be parsed earlier from the
sharedStings.xml file) should be available already in the private attribute
shared_strings, a simple list of strings (of type STRINGTAB).
In order to restrict the code of the test methods to the essential, specific parts, the
get_reader( ) method only receives the XML code snippet for the cell to be parsed (or multiple cells in a row): before creating the reader instance, this nucleus is wrapped into a document skeleton, to emulate a full XML document:
*
method get_reader.
data: lv_full type string.
concatenate `<root><sheetData><row>` iv_xml `</row></sheetData></root>` into lv_full.
eo_reader = cl_sxml_string_reader=>create( cl_abap_codepage=>convert_to( lv_full ) ).
endmethod. "get_reader
On Shared Strings and the Boundary Values Problem
As a local friend, the test class is allowed to pre-fill the shared_strings table attribute. Therefore, we can pretend to have already parsed the sharedStrings.xml file by adding some values to the shared_strings table. Afterwards, we call the reader and check that a cell with a string reference is correctly resolved. Here is the test for correct reading of the string table:
*
method test_shared_string.
data lo_reader type ref to if_sxml_reader.
append `Test1` to out->shared_strings.
append `Test2` to out->shared_strings.
lo_reader = get_reader(
`<c r="A1" t="s"><v>1</v></c>`
).
out->read_worksheet_data( io_reader = lo_reader io_worksheet = worksheet ).
assert_value_equals( `Test2` ).
assert_datatype_equals( `s` ).
endmethod. "test_shared_string
For shared strings, the content of the <v> element is the 0-based index for reading the string table. Therefore, the cell content in this example has to be 'Test2' after parsing.
In a former version, I had been too minimalistic: I used only one entry in the string table, and the <v> element contained 0, not 1, to point to this one entry.
The test in its former version passed, but it didn't detect a bug in the code, due to the "boundary value" anti-pattern: 0 is a special value, since it also is the initial value for integer variables. The bug was that the index was not transferred at all to the variable holding the index, which thus remained on its initial value, zero. Only by extending the test with a second string and working with the non-special index value 1 instead, I could reproduce the bug - and then fix it.
Test Coverage
While working on the code and testing it, it's good to have a look on the test coverage. It shows possible uncovered coding extracts in which a bug might be hidden.
Of course, some parts are intentionally uncovered. In this class, the methods
load_shared_strings( ),
load_worksheet( ) and
get_sxml_reader( ) have a zero coverage on purpose, since they provide data from outside. I only test how the class works on these data. But the rest of the methods have a 100% statement coverage.
The
branch coverage (a column which is hidden by default in the coverage analyzer's default layout) shows that some conditions could still be added. Indeed, I have no particular tests for invalid
sheet.xml documents - for example, I always expect that there is the
sheetData element, as the
Open XML specification requires.
As the boundary value problem above shows, full statement coverage - or even full branch coverage - is not sufficient for detecting all bugs sleeping in the code. An instinct for potential bugs is necessary while writing the test method. It's an interesting change of hat: when writing test methods, we are supposed to turn into the most fervent critics of ourselves!