assessing migration risk for scientific formats
play

Assessing Migration Risk for Scientific Formats Chris Frisz, Sam - PowerPoint PPT Presentation

Assessing Migration Risk for Scientific Formats Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Presented 7 December 2011 Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing


  1. Formats – Lotus 1-2-3 – Example In Lotus 1-2-3: − 4 2 = − 16 Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

  2. Formats – Lotus 1-2-3 – Example In Lotus 1-2-3: − 4 2 = − 16 In Excel: − 4 2 Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

  3. Formats – Lotus 1-2-3 – Example In Lotus 1-2-3: − 4 2 = − 16 In Excel: − 4 2 Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

  4. Formats – Lotus 1-2-3 – Example In Lotus 1-2-3: − 4 2 = − 16 In Excel: − 4 2 = 16 Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

  5. Formats – Lotus 1-2-3 – Example In Lotus 1-2-3: − 4 2 = − 16 In Excel: − 4 2 = 16 Traditional mathematical order of operations favors Lotus. Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

  6. Formats – Lotus 1-2-3 – Conversion issues (cont.) Comparison/logical operators (i.e. = or #and#) and string concatenation (&) also differ in order of operations. Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

  7. Formats – Lotus 1-2-3 – Conversion issues (cont.) Comparison/logical operators (i.e. = or #and#) and string concatenation (&) also differ in order of operations. Comparison and logical operators were evaluated first in Lotus 1-2-3. Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

  8. Formats – Lotus 1-2-3 – Conversion issues (cont.) Comparison/logical operators (i.e. = or #and#) and string concatenation (&) also differ in order of operations. Comparison and logical operators were evaluated first in Lotus 1-2-3. Concatenation was evaluated first in Excel. Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

  9. Formats – Lotus 1-2-3 – Conversion Issues – Example In Lotus 1-2-3: “Fo”&“o” = “Foo” Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

  10. Formats – Lotus 1-2-3 – Conversion Issues – Example In Lotus 1-2-3: “Fo”&“o” = “Foo” Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

  11. Formats – Lotus 1-2-3 – Conversion Issues – Example In Lotus 1-2-3: “Fo”&“o” = “Foo” → False Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

  12. Formats – Lotus 1-2-3 – Conversion Issues – Example In Lotus 1-2-3: “Fo”&“o” = “Foo” → False In Excel: “Fo”&“o” = “Foo” Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

  13. Formats – Lotus 1-2-3 – Conversion Issues – Example In Lotus 1-2-3: “Fo”&“o” = “Foo” → False In Excel: “Fo”&“o” = “Foo” Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

  14. Formats – Lotus 1-2-3 – Conversion Issues – Example In Lotus 1-2-3: “Fo”&“o” = “Foo” → False In Excel: “Fo”&“o” = “Foo” → True Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

  15. Formats – CDF and netCDF CDF and netCDF are both file formats utilized for multidimensional data. Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

  16. Formats – CDF and netCDF CDF and netCDF are both file formats utilized for multidimensional data. Often used to represent image, climate, and elevation data. Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

  17. Formats – CDF/netCDF Layout Record rVariable rVariable . . . rVariable Number 1 2 n !!!!! !!!!! !!!!! 1 !!!!! !!!!! !!!!! !!!!! !!!!! !!!!! . . . !!!!! !!!!! !!!!! !!!!! !!!!! !!!!! !!!!! !!!!! !!!!! 2 !!!!! !!!!! !!!!! !!!!! !!!!! . . . !!!!! !!!!! !!!!! !!!!! !!!!! !!!!! !!!!! 3 !!!!! !!!!! !!!!! !!!!! !!!!! !!!!! !!!!! !!!!! . . . !!!!! !!!!! !!!!! !!!!! !!!!! !!!!! !!!!! Image courtesy of NASA/Goddard Space Flight Center Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

  18. Formats – CDF/netCDF Layout Image courtesy of NASA/Goddard Space Flight Center Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

  19. Formats – CDF/netCDF Layout Image courtesy of NASA/Goddard Space Flight Center Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

  20. Formats – CDF/netCDF Layout Image courtesy of NASA/Goddard Space Flight Center Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

  21. Formats – CDF/netCDF – Background CDF originally developed by NASA. Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

  22. Formats – CDF/netCDF – Background CDF originally developed by NASA. NetCDF developed later by NCAR based on the CDF. Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

  23. Formats – CDF/netCDF – Background CDF originally developed by NASA. NetCDF developed later by NCAR based on the CDF. Both formats still currently supported. Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

  24. Formats – CDF/netCDF – Background (cont.) Separate development allowed for evolution of different features. Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

  25. Formats – CDF/netCDF – Background (cont.) Separate development allowed for evolution of different features. Overall functionality remained similar. Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

  26. Formats – CDF/netCDF – Background (cont.) Separate development allowed for evolution of different features. Overall functionality remained similar. Primary conversion path between CDF and netCDF was through NASA’s Data Translation Web Service (DTWS). Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

  27. Formats – CDF – Conversion Issues Features present in CDF, not in netCDF: Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

  28. Formats – CDF – Conversion Issues Features present in CDF, not in netCDF: Multi-file format for organizing variables into different files. Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

  29. Formats – CDF – Conversion Issues Features present in CDF, not in netCDF: Multi-file format for organizing variables into different files. Native-mode encoding for faster data access on particular system architectures. Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

  30. Formats – CDF – Conversion Issues Features present in CDF, not in netCDF: Multi-file format for organizing variables into different files. Native-mode encoding for faster data access on particular system architectures. Epoch data type for high-resolution time data. Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

  31. Formats – CDF – Conversion Issues Features present in CDF, not in netCDF: Multi-file format for organizing variables into different files. Native-mode encoding for faster data access on particular system architectures. Epoch data type for high-resolution time data. Multi-file and native-mode differences were identified in CDF documentation. Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

  32. Formats – CDF – Conversion Issues Features present in CDF, not in netCDF: Multi-file format for organizing variables into different files. Native-mode encoding for faster data access on particular system architectures. Epoch data type for high-resolution time data. Multi-file and native-mode differences were identified in CDF documentation. Epoch data type mismatch was discovered through DTWS source code review. Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

  33. Formats – netCDF – Conversion Issues Features present in netCDF, not in CDF: Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

  34. Formats – netCDF – Conversion Issues Features present in netCDF, not in CDF: Descriptive named dimensions usable for data access Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

  35. Formats – netCDF – Conversion Issues Features present in netCDF, not in CDF: Descriptive named dimensions usable for data access Support for up 32 dimensions per variable (versus CDF’s 10) Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

  36. Formats – netCDF – Conversion Issues Features present in netCDF, not in CDF: Descriptive named dimensions usable for data access Support for up 32 dimensions per variable (versus CDF’s 10) Named dimensions mismatch was documented in NASA’s CDF FAQ. Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

  37. Formats – netCDF – Conversion Issues Features present in netCDF, not in CDF: Descriptive named dimensions usable for data access Support for up 32 dimensions per variable (versus CDF’s 10) Named dimensions mismatch was documented in NASA’s CDF FAQ. Maximum dimension mismatch was discovered through netCDF API code review. Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

  38. Formats – HDF Hierarchical data format for relating and interacting with hetergenous data sets. Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

  39. Formats – HDF Hierarchical data format for relating and interacting with hetergenous data sets. Organized similarly to Unix file system with Vgroups like directories and Vdata like files. Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

  40. Formats – HDF layout 345&4"+"&<+.01+0.%, =",+%.&->":% ?"$%++% ^XF57<?$USF57<$249$(#4#)2;$ /2:<#)_ <12%*+2(21&4"+"&<%+ ^P+;<797B#4:7*42;$2))2@_ T [ \ UVW S=1VXW 2?5?3?9 !"7:$%&'$A7;#$3*4<274:$*4# IX W=YU1S 9?3?5?2 #L2B,;#$*A$#23"$%&'$92<2 9**'+"+2'* WRYZ U=Y1XU A?C?"?7 @/"+" $ <@,#=$ VXR S=ZY1I ]?G?;?B ^!25;#_ U1Z I=XV1Z B?;?G?] @:.'0A ^()*+,$*A$%&'$92<2$:<)+3<+)#:_ T [ \ UVW S=1VXW 2539 !"7:$%&'$A7;#$3*4<274:$*4# IX W=YU1S 9352 #L2B,;#$*A$#23"$%&'$92<2$<@,#=$ WRYZ U=Y1XU AC"7 VXR S=ZY1I ]G;B U1Z I=XV1Z B;G] Image courtesy of the HDF Group. Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

  41. Formats – HDF – Background Developed by the National Center for Supercomputing Applications. Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

  42. Formats – HDF – Background Developed by the National Center for Supercomputing Applications. Support provided by the HDF Group. Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

  43. Formats – HDF – Background Developed by the National Center for Supercomputing Applications. Support provided by the HDF Group. Most recent version was HDF5. Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

  44. Formats – HDF – Background (cont.) Previous versions were backwards compatible. Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

  45. Formats – HDF – Background (cont.) Previous versions were backwards compatible. HDF5 drastically changed data model and broke backwards compatibility. Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

  46. Formats – HDF – Background (cont.) Previous versions were backwards compatible. HDF5 drastically changed data model and broke backwards compatibility. HDF Group provided both conversion API and automatic tool. Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

  47. Formats – HDF – Conversion Issues Merging Vgroups with elements sharing the same name resulted in renaming of one element. Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

  48. Formats – HDF – Conversion Issues Merging Vgroups with elements sharing the same name resulted in renaming of one element. This was only relevant for manual conversion. Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

  49. Formats – HDF – Conversion Issues Merging Vgroups with elements sharing the same name resulted in renaming of one element. This was only relevant for manual conversion. Data object shared between Vgroups were copied on conversion. Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

  50. Formats – HDF – Conversion Issues – Example Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

  51. Formats – HDF – Conversion Issues – Example Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

  52. Formats – HDF – Conversion Issues – Example Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

  53. Formats – HDF – Conversion Issues – Example Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

  54. Formats – HDF – Conversion Issues Merging Vgroups with elements sharing the same name resulted in renaming of one element. This was only relevant for manual conversion. Data object shared between Vgroups were copied on conversion. Unnamed data objects were given default names Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

  55. Formats – HDF – Conversion Issues Merging Vgroups with elements sharing the same name resulted in renaming of one element. This was only relevant for manual conversion. Data object shared between Vgroups were copied on conversion. Unnamed data objects were given default names The HDF Group documented all of these issues for the HDF4-to-HDF5 conversion API and automated tool. Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

  56. Tools – Lotus 1-2-3 We wrote a C program to traverse 1-2-3 files and parse formulas. Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

  57. Tools – Lotus 1-2-3 We wrote a C program to traverse 1-2-3 files and parse formulas. It identified presence of @MOD, @VLOOKUP, or @HLOOKUP in formulas. Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

  58. Tools – Lotus 1-2-3 We wrote a C program to traverse 1-2-3 files and parse formulas. It identified presence of @MOD, @VLOOKUP, or @HLOOKUP in formulas. The program also conservatively reported presence of both exponentiation and negation or logical/comparison operators and string concatenation. Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

  59. Tools – Lotus 1-2-3 (cont.) Tool consisted of approximately 500 lines. Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

  60. Tools – Lotus 1-2-3 (cont.) Tool consisted of approximately 500 lines. Processed our entire data set in less than 15 mintues. Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

  61. Tools – CDF and netCDF We wrote C programs for each CDF and netCDF. Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

  62. Tools – CDF and netCDF We wrote C programs for each CDF and netCDF. CDF program consisted of 300 lines using the version 3.3.0 API from NASA. Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

  63. Tools – CDF and netCDF We wrote C programs for each CDF and netCDF. CDF program consisted of 300 lines using the version 3.3.0 API from NASA. NetCDF program was 150 lines using the version 4.1.3 API from Unidata. Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

  64. Tools – CDF and netCDF We wrote C programs for each CDF and netCDF. CDF program consisted of 300 lines using the version 3.3.0 API from NASA. NetCDF program was 150 lines using the version 4.1.3 API from Unidata. Processed entire 61,000-file data set in 55 minutes. Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

  65. Tools – CDF and netCDF We wrote C programs for each CDF and netCDF. CDF program consisted of 300 lines using the version 3.3.0 API from NASA. NetCDF program was 150 lines using the version 4.1.3 API from Unidata. Processed entire 61,000-file data set in 55 minutes. NetCDF tool exhibited similar performance. Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

  66. Tools – HDF Yet again, wrote a C program. Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

  67. Tools – HDF Yet again, wrote a C program. Written in 900 lines using the 4.2.6 API from the HDF Group. Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

  68. Tools – HDF Yet again, wrote a C program. Written in 900 lines using the 4.2.6 API from the HDF Group. This tool was longer because of large number of interfaces. Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

  69. Tools – HDF Yet again, wrote a C program. Written in 900 lines using the 4.2.6 API from the HDF Group. This tool was longer because of large number of interfaces. Processed all HDF files in our data set within 1.5 minutes. Chris Frisz, Sam Waggoner, and Geoffrey Brown Indiana University Bloomington Assessing Migration Risk for Scientific Formats

Recommend


More recommend