LibreOffice Calc Spreadsheets on the GPU Michael Meeks <michael.meeks@collabora.com> mmeeks, #libreoffice-dev, irc.freenode.net “Stand at the crossroads and look; ask for the ancient paths, ask where the good way is, and walk in it, and you will find rest for your souls...” - Jeremiah 6:16
Overview ● LibreOffice ? ● A bit about: ● GPUs … ● Spreadsheets ● Internal re-factoring ● OpenCL optimisation ● new calc features ● XML / load performance ● Calc / GPU questions ? ● Questions ?
LibreOffice Project & Software • Open Source / Free Cumulative unique IP's for updates vs. time Software not counting any Linux / vendor versions • One million new unique IPs per week (that we can 60,000,000 track) 50,000,000 • Double the weekly growth one year ago. 40,000,000 • Tens of millions of users, and growing fast. 30,000,000 • Hundred+ contributing 20,000,000 coders each month • 2500+ commits last 10,000,000 month 0 • Around a thousand developers ( including QA, Translators, UX etc. http://www.libreoffice.org/
Advisory Board Members This slide's layout is a victim of our success here ... 4 / 41 Event Name | Your Name
Why use the GPU ?
APUs – GPU faster than CPU ● Tons of un-used Compute Units across your APU ● Double precision is un-reasonably slower ● And precision is non-negotiable for Numbers based spreadsheets IEE764 required. on a Kaveri 7850K ● Better power usage per flop. APU - & top-end discrete Graphics card. fp64 CPU flops GPU flops FirePro 7990 fp32 1 10 100 1000 10000 Flops : note the log scale ...
Developers behind the calc re-work: Kohei Yoshida: MDDS maintainer Jagan Lokanatha Heroic calc core re-factorer Kismat Singh Code Ninja etc. Markus Mohrhard Calc maintainer, Chart2 wrestler Unit tester par Excellence etc. Matus Kukan Data Streamer, G-builder, A large OpenCL team, Size optimizer .. Particularly I-Jui (Ray) Sung
Spreadsheet Geometry An early Spreadsheet C 3000 BC Excel 2003 Excel 2010 Aspect ratio: 8:1 64k x 256 10^6 x 16k Contents: Aspect: Aspect: Victory against 256:1 16:1 every land … who giveth all life The 'Broom forever … Handle' aspect 50% of ratio. spreadsheets used to make business decisions . Columnar data structures
Spreadsheet Core Data Storage
The joy of Object Orientation ScTable ScBaseCell ScDocument Broadcaster (8 bytes) ScColumn Text width (2 bytes) Cell type (1 byte) Script type (1 byte) ScValueCell ScFormulaCell ScStringCell ScEditCell ScNoteCell* 10 / 41 Event Name | Your Name
Abstraction of Cell Value Access ScBaseCell Usage (Before) ScDocument Undo / Redo RTF Filter Change Tracking Quattro Pro Filter Content Rendering HTML Filter Excel Filter (xls, xlsx) External Reference Document Iterators CSV Filter DIF Filter UNO API Layer Conditional Formatting SYLK Filter VBA API Layer Chart Data Provider DBF Filter ODF Filter Cell Validation CppUnit Test 11
Abstraction of Cell Value Access ScBaseCell Usage (After) Biggest calc core re-factor in a decade+ ScDocument Dis-infecting the horrible, long-term, inherited structural problems of Calc. Lots of new unit tests being created for the first time for Document Iterators the calc core. Moved to using new 'MDDS' data structures. 2x weeks with no compile ... 12
Before (ScBaseCell) Scattered ScTable pointer chasing walking cells ScBaseCell down a column ... ScDocument Broadcaster (8 bytes) ScColumn Text width (2 bytes) Cell type (1 byte) Script type (1 byte) ScValueCell ScFormulaCell ScStringCell ScEditCell ScNoteCell* 13 / 41 Event Name | Your Name
After (mdds::multi_type_vector) ScTable ScColumn svl::SharedString block ScDocument double block EditTextObject block ScFormulaCell block Broadcasters Cell notes Text widths Cell values Script types 14 / 41 Event Name | Your Name
Iterating over cells (old way) … loop down a column … and the inner loop: double nSum = 0.0; ScBaseCell* pCell = pCol >maItems[nColRow].pCell; ++nColRow; switch (pCell->GetCellType()) { case CELLTYPE_VALUE: nSum += ((ScValueCell*)pCell)->GetValue(); break; case CELLTYPE_FORMULA: … something worse ... case CELLTYPE_STRING: case CELLTYPE_EDIT: … case CELLTYPE_NOTE: … } 15 / 41 Event Name | Your Name
Iterating over cells (new way) double nSum = 0.0; for (size_t i = 0; i < nChunkLength; i++) nSum += pDoubleChunk[i]; ONO. from a vectoriser ... 16 / 41 Event Name | Your Name
Shared Formula
Before … Tokens ScTokenArray ScFormulaCell … RPN ... ScFormulaCell ScTokenArray ... ScFormulaCell ScTokenArray ScTokenArray ScFormulaCell ScTokenArray ScFormulaCell ScTokenArray ScFormulaCell ScTokenArray ScFormulaCell 18 / 41 Event Name | Your Name
After ScFormulaCell ScFormulaCell ScFormulaCellGroup ScFormulaCell … Tokens ScTokenArray ScFormulaCell … RPN ScFormulaCell ScFormulaCell ScFormulaCell 19 / 41 Event Name | Your Name
Memory usage 400 372 Heap memory size (MB) 300 259 200 100 27 0 Shared formula on Empty document Shared formula off Test document used: http://kohei.us/wp-content/uploads/2013/08/shared-formula-memory-test.ods 20 / 41 Event Name | Your Name
Shared string re-work ● String comparisons were slow ● Also not tractable for a GPU ● Case-insensitive equality is a hard problem – ICU & heavy lifting. ● String comparisons a lot in functions, and Pivot Tables. ● Shared string storage is useful. ● So fix it ...
Concept svl::SharedStringPool svl::SharedString Original string pool svl::SharedString Upcased string pool svl::SharedString 22 / 41 Event Name | Your Name
String comparison (old way) 23 / 41 Event Name | Your Name
String comparison (new way) 24 / 41 Event Name | Your Name
OpenCL / calculation ...
Why OpenCL & HSA ... ● GPU and CPU optimisation … ● Why write custom SSE2/SSE3 etc. assembly detect arch, and select backend cross platforms. ● Instead get OpenCL (from APU vendor) to generate the best code ... ● Hetrogenous System Architecture rocks: ● An AMD64 like innovation: ● shared Virtual Memory Address space & pointers: ↔ GPU CPU. ● Avoid wasteful copies, fast dispatch ● Great OpenCL 2.0 support. ● Use the right Compute Unit for the job.
Auto-compile Formula → OpenCL #pragma OPENCL EXTENSION cl_khr_fp64: enable int isNan(double a) { return isnan(a); } double legalize(double a, double b) { return isNan(a)?b:a;} double tmp0_0_fsum(__global double *tmp0_0_0) { double tmp = 0; { int i; i = 0; tmp = legalize(((tmp0_0_0[i])+(tmp)), tmp); i = 1; Formulae compiled idly / on tmp = legalize(((tmp0_0_0[i])+(tmp)), tmp); entry in a thread … to hide i = 2; tmp = legalize(((tmp0_0_0[i])+(tmp)), tmp); latency. } // to scope the int i declaration return tmp; Kernel generation thanks } to: double tmp0_nop(__global double *tmp0_0_0) { double tmp = 0; int gid0 = get_global_id(0); tmp = tmp0_0_fsum(tmp0_0_0); return tmp; } __kernel void DynamicKernel_nop_fsum(__global double *result, __global double *tmp0_0_0) { int gid0 = get_global_id(0); result[gid0] = tmp0_nop(tmp0_0_0); }
__kernel void The same formula for a longer sum … tmp0_0_0_reduction(__global double* A, __global double *result, int arrayLength, int windowSize) Compiled from standard formula syntax { double tmp, current_result =0; int writePos = get_group_id(1); int lidx = get_local_id(0); double tmp0_0_fsum(__global double __local double shm_buf[256]; *tmp0_0_0) { int offset = 0; double tmp = 0; int end = windowSize; int gid0 = get_global_id(0); end = min(end, arrayLength); tmp = ((tmp0_0_0[gid0])+(tmp)); barrier(CLK_LOCAL_MEM_FENCE); return tmp; int loop = arrayLength/512 + 1; } for (int l=0; l<loop; l++) { double tmp0_nop(__global double tmp = 0; *tmp0_0_0) { int loopOffset = l*512; double tmp = 0; if((loopOffset + lidx + offset + 256) < end) { int gid0 = get_global_id(0); tmp = legalize(((A[loopOffset + lidx + offset])+ tmp = tmp0_0_fsum(tmp0_0_0); (tmp)), tmp); return tmp; tmp = legalize(((A[loopOffset + lidx + offset + } 256])+(tmp)), tmp); __kernel void } else if ((loopOffset + lidx + offset) < end) DynamicKernel_nop_fsum(__global double tmp = legalize(((A[loopOffset + lidx + offset])+ *result, (tmp)), tmp); shm_buf[lidx] = tmp; __global double *tmp0_0_0) barrier(CLK_LOCAL_MEM_FENCE); { for (int i = 128; i >0; i/=2) { int gid0 = get_global_id(0); if (lidx < i) result[gid0] = tmp0_nop(tmp0_0_0); shm_buf[lidx] = ((shm_buf[lidx])+ } (shm_buf[lidx + i])); barrier(CLK_LOCAL_MEM_FENCE); } if (lidx == 0) current_result =((current_result)+(shm_buf[0])); barrier(CLK_LOCAL_MEM_FENCE); } if (lidx == 0) result[writePos] = current_result; }
Performance numbers for sample sheets. GPU / OpenCL Software min_max_avg_r 30x → 500x faster for destination-workbook Shorter is better these samples vs. dates-worked the legacy software calculation stock-history on Kaveri. ground-water 1 10 100 1,000 10,000 100,000 Yet another log plot … milliseconds on the X axis ...
Recommend
More recommend