e-Research Summer Hackfest

chaired by Roberto Barbera (University of Catania)
from to (Europe/Rome)
at Catania ( Aula B (1st floor) )
Department of Physics and Astronomy - Via S. Sofia, 64 - 95123 Catania
Description

 

“Bring your science to the web and the web to your science”

 

Overview and objectives

The e-Research Summer Hackfest will be held at the Department of Physics and Astronomy of the University of Catania on July, 4-15, 2016.

The event is co-sponsored by the Sci-GaIA, INDIGO-DataCloud, and COST ENeL projects.

The main objective of the event is to integrate scientific use cases through a pervasive adoption of web technologies and standards and make them available to their end users through Science Gateways (entities connected to distributed computing, data and services of interest to the Community of Practice the end users belong to). Promoting and fostering open and reproducible research will be the ultimate goal of the hackfest.    

Topics

The following topics will be tackled during the eResearch Summer Hackfest:

  • Big Data analytics;

  • Distributed computing services;

  • Distributed storage services;

  • Programmable access to Open Data repositories;

  • Semantic federation of Open Access repositories;

  • User interfaces (web, desktop, mobile, etc.);

  • Workflows.

Tools and technologies

The following tools and technologies will be showcased at the e-Research Summer Hackfest and used to implement the proposed use cases:

Contact

For all questions you may have regarding the Sci-GaIA e-Research Summer Hackfest, please contact us by email at summer-school@sci-gaia.eu.

Material:
Support Email: summer-school@sci-gaia.eu
Go to day
  • Monday, 4 July 2016
    • 08:30 - 08:50 Registration and badging of participants 20'
    • 08:50 - 09:00 Welcome address 10'
      The welcome address is given by Prof. Valerio Pirronello who is the Director of the Department of Physics and Astronomy of the University of Catania
      Material: Video link
    • 09:00 - 09:30 The Sci-GaIA project and introduction to the hackfest 30'
      Speaker: Roberto Barbera (University of Catania)
      Material: Slides powerpoint file Video lecture link
    • 09:30 - 10:00 The INDIGO-DataCloud project 30'
      This presentation will be given remotely.
      Speaker: Giacinto Donvito (INFN)
      Material: Slides pdf file Video lecture link
    • 10:00 - 11:30 Day 1 - The FutureGateway framework
      In this section will be first introduced the FutureGateway (FG) framework, described each of its components and explained how they work together. During the presentation will be also covered some security considerations related to the Science Gateway membership handling using as point of reference the standard baseline AAI mechanism provided by the standard FG installation and how to modify it to switch between already existing AAI mechanisms. The FutureGateway provides also a complete set of REST APIs to manage distributed computing resources, which will be briefly described.
      Conveners: Riccardo Bruno (INFN), Marco Fargetta (INFN)
      Material: Slides powerpoint filedown arrow Video lecture - part 1 link Video lecture - part 2 link Video lecture - part 3 link Video lecture - part 4 link
    • 11:30 - 12:00 Coffee break
    • 12:00 - 13:00 The FutureGateway framework - continued
      In this section will be shown how to setup a FutureGateway instance making use of several baseline installation scripts. It will be also shown how to manage, maintain and update the system. The usage of several REST APIs wil be also presented by real examples.
      Conveners: Riccardo Bruno (INFN), Marco Fargetta (INFN)
      Material: Slides powerpoint file Video lecture - part 5 link Video lecture - part 6 link Video lecture - part 7 link
    • 13:00 - 14:00 Lunch break
    • 14:00 - 15:30 The INDIGO PaaS
      In this section the PaaS Layer architecture of the INDIGO-DataCloud Project will be described focusing on the technologies and solutions adopted for each component. 
      The interaction among the various pieces will be shown through the description of the main scenarios: the deployment of “IaaS automated services”, the deployment of a “PaaS service”, i.e. a Long-Running service (such as DBMS) or a user application to be run with specific input/output data. 
      Convener: Marica Antonacci (INFN)
      Material: Slides pdf file Video lecture - part 1 link Video lecture - part 2 link Video lecture - part 3 link
    • 15:30 - 16:00 Coffee break
    • 16:00 - 17:00 The INDIGO PaaS - continued
      In this section some practical examples of the INDIGO PaaS usage will be shown.
      The starting point will be the TOSCA template that describes the topology of the services to be deployed.
      Then how to submit the template to the INDIGO Orchestrator will be shown along with the monitoring of the deployment status.
      Different typologies of TOSCA templates will be demonstrated, e.g. Galaxy template, Mesos cluster, execution of Chronos jobs, deployment of Marathon application, etc. 
      Convener: Marica Antonacci (INFN)
      Material: Screencasts unknown type filedown arrow Slides pdf file Video tutorial - part 1 link Video tutorial - part 2 link Video tutorial - part 3 link Video tutorial - part 4 link Video tutorial - part 5 link
    • 17:00 - 18:30 The gLibrary framework 1h30'
      In this presentation, we introduce gLibrary 2.0, a platform that permits to create REST APIs over existing databases or new datasets. It supports both relational and non-relational (i.e. schema-less) datasets. It also provides data storage services to Grid and Cloud (OpenStack-based) Storage Servers. After a general overview and the architecture, we will show live how to create a new repository, importing data collections from an existing database, creating new collections from scratch, make queries and use replicas/attachments to handle file transfers.
      Speaker: Antonio Calanducci (INFN)
      Material: Slides powerpoint file Tutorial link Video lecture link Video live demo link
  • Tuesday, 5 July 2016
    • 08:30 - 09:00 Registration and badging of participants 30'
    • 09:00 - 10:00 Programmatic interaction with Open Access Repositories
      In this section we will introduce the concept of Digital Asset Management System and talk about the programmatic interaction with Open Access Repositories (based on Invenio). Then, we will show how submit different types of resource manually through the repository. After that, we will start with the programmatic interaction with Open Access Repository through the use of APIs for data searching, downloading and uploading. We will have a brief look at the MARCXML tags.  At the end, we will see how to interact with the Open Access Repository using the OAI-PMH standard protocol and how to provide authorship to research products stored on an Open Access Repository.
      Conveners: Roberto Barbera (University of Catania), Carla Carrubba (University of Catania)
      Material: Slides powerpoint file Video lecture - part 1 link Video lecture - part 2 link Video lecture - part 3 link XML exemplar file to upload contents on the OAR unknown type file
    • 10:00 - 11:00 The Onedata platform
      In this session we will give an overview of Onedata concepts such as spaces, user groups and providers. We will then discuss onedata system's internal architecture with a focus on scalability, fault tolerance and remote data access. onedata's implementation of CDMI protocol will be briefly discussed along with features for metadata management. 
      Conveners: Krzysztof Trzepla (CYFRONET), Konrad Zemek (CYFRONET)
      Material: Slides powerpoint file pdf file Video lecture - part 1 link Video lecture - part 2 link Video lecture - part 3 link Video lecture - part 4 link
    • 11:00 - 11:30 Coffee break
    • 11:30 - 13:00 The Onedata platform - continued
      To conclude the session we will present broadly some of our plans for the close future, focusing on Opendata integration in the system. During the presentation we will hold a live demo of Onedata followed by a hands-on session for the audience.
      Conveners: Krzysztof Trzepla (CYFRONET), Konrad Zemek (CYFRONET)
    • 13:00 - 14:00 Lunch break
    • 14:00 - 16:00 The Ophidia platform
      The Ophidia project is a research effort on big data analytics facing scientific data analysis challenges in the climate change domain. Ophidia provides declarative, server-side, and parallel data analysis, jointly with an internal storage model able to efficiently deal with multidimensional data and a hierarchical data organization to manage large data volumes (“datacubes”). The project relies on a strong background on high performance database management and OLAP systems to manage large scientific datasets. The Ophidia analytics platform provides several data operators to manipulate datacubes, and array-based primitives to perform data analysis on large scientific data arrays. Metadata management support is also provided. The server front-end exposes several interfaces to address interoperability requirements: WS-I+, GSI/VOMS and OGC-WPS (through PyWPS). From a programmatic point of view a Python module (PyOphidia) makes straightforward the integration of Ophidia into Python-based environments and applications (e.g. IPython). The system offers a CLI (e.g. bash-like) with a complete set of commands. A key point of the talk will be the workflow capabilities offered by Ophidia. In this regard, the framework stack includes an internal workflow management system, which coordinates, orchestrates, and optimises the execution of multiple scientific data analytics & visualization tasks. Specific macros are also available to implement loops, or to parallelize them in case of data independence. Real-time workflow monitoring execution is also supported through a graphical user interface. Some real workflows implemented at CMCC and related to different EU projects will be also presented.
      Convener: Alessandro D'Anca (CMCC)
      Material: Slides pdf filedown arrow Video lecture - part 1 link Video lecture - part 2 link Video lecture - part 3 link
    • 16:00 - 16:30 Coffee break
    • 16:30 - 18:00 The Kepler workflow manager
      This session involves examination of Kepler as a tool for building scientific workflows. Emphases are on development of basic skills that will allow attendees to get familiar with the process of building workflows. We will cover topics covering simple tasks and present how to express typical programming constructs in workflow based environment. We will discuss simple workflows, composite actors, ways of switching data flow, building loops, and calling Python code directly from the workflow. This session will provide students with general "feel" of Kepler workflow management system (https://kepler-project.org). #### Organization
      
      This session is a tutorial based, hands on session. Students are required to take active part during the training. All materials will be available and each task will be explained with required level of details. In case students will face issues while following tasks, they are encouraged to raise their doubts. #### Session objectives
      
      1. to describe basics of Kepler workflow management system 2. to introduce students to the workflow based computations 3. to introduce students to the process of building workflows using Kepler 4. to introduce students to more complex topics: loops, Python execution.
      Convener: Michal Owsiak (PSNC)
      Material: Slides pdf file Tutorial link Video lecture link Video live demo link
  • Wednesday, 6 July 2016
    • 09:00 - 19:00 Day 3 - Presentation of use cases and their implementation stategies
      • 10:00 Error Correction of NGS Data 30'
        The error correction of the NGS data is normally the first step of any application targeting NGS.  Many projects in different real-life applications have opted for  this step  before further analysis.  MuffinEC is a multi-technology (Illumina, Roche 454, Ion Torrent and PacBio - experimental), any-type-of-error  handling (mismatches, deletions insertions and unknown values) corrector. It surpasses other similar software by providing higher accuracy (demonstrated by four types of tests) and using less computational resources. It follows a multi-steps approach that starts by  grouping all the reads using a k-mers based metric. Next, it employs the powerful Smith-Waterman algorithm to refine the groups and generate Multiple Sequence Alignments (MSAs). These MSAs are corrected by taking each column and looking for the correct base, determined by a user-adjustable percentage. We plan to use Ophidia and Onedata to prepare our software for the cloud.
        Speaker: Andy S. Alic (Universitat Politecnica de Valencia - Spain)
        Material: Slides powerpoint file
      • 10:30 Algae Bloom Case Study: Managing Data From Models 30'
        The Hydrodynamic and Water Quality modeling requires a number of parameters that are strongly correlated. Due to that number and the space and temporal needs of high resolution models the input and output files are pretty big. Delft3D software suite is the tool used to perform the modeling, and includes the simulation of the physical, chemical and biological parameters of a Water Reservoir in Soria, Spain. This case study aims to perform the modeling of the reservoir automatically under a cloud framework. In the context of the Hackfest, three different tools could be used:
        -OneData: We need a distributed storage solution to share a common space for input (accessible by computing) and output generated by the model (accessible by users).
        -Ophidia: Big Data tools are very interesting to analyze the big amount of parameters available in the output.
        -Kepler: a workflow to automatically analyze the results could be very useful.
        Speaker: Fernando Aguilar (IFCA - Spain)
        Material: Slides pdf file
      • 11:00 Coffee break 30'
      • 11:30 Distributed Archive System for the Cherenkov Telescope Array 30'
        The Cherenkov Telescope Array (CTA) project aims to build a large array of Cherenkov telescopes of different sizes and deployed on an unprecedented scale. It will allow a significant extension of our current knowledge in high-energy astrophysics. The CTA data and their scientific products need to be preserved in a dedicated archive guaranteed to provide open access to a wide and diverse scientific community. Handling and archiving the large amount of data generated by the instruments and delivering scientific products according to astrophysical standards is one of the challenges in designing the CTA observatory. We present our plan to implement a distributed archive system federating storages using the OneData platform (and/or other promising INDIGO-DataCloud technologies).
        Speaker: Eva Sciacca (INAF - Astrophysical Observatory of Catania - Italy)
        Material: Slides pdf file
      • 12:00 Astronomical data format integration into Ophidia 30'
        FITS format is the standard data format for archiving images in astronomy. By means of this use case we aim at integrating the FITS format into Ophidia framework opening the path to the analysis of astronomical data within this powerful tool.
        Speaker: Elisa Londero (INAF - Astronomic Observatory of Trieste - Italy)
        Material: Slides pdf file
      • 12:30 Collaborative Knowledge Discovery Environment on Biodiversity and Linguistic Diversity 30'
        The project aim is to establish a collaborative / team science workflow and enable knowledge discovery as well as experimental scholarship in biodiversity and linguistic diversity. 
        Our first step towards this is to establish a working environment (workspace) for researchers to explore linguistic diversity and interconnection of languages and cultural artefacts / data in linguistic and biological domain. 
        We aim to provide users of different domains and with several backgrounds (researchers of different disciplines, layman) with services/applications for workflows to discover, curate and interlink biological taxonomic data with linguistic/ terminological and cultural data, enrich and connect their data to external resources and publish them freely accessable on the web as open data.
        The project is connected to ongoing initiatves like the COST ENeL action (european network of electornic lexicography).
        Speakers: Eveline Wandl-Vogt (OEAW-ACDH - Austria), Ksenya Zaytseva (OEAW-ACDH - Austria), Davor Ostojic (OEAW-ACDH - Austria)
        Material: Slides powerpoint file
      • 13:00 Lunch break 1h0'
      • 14:00 Reproducible Automatic Speech Recognition workflows 30'
        The use-case proposed is specific for the rich community of Human Language Technologies users in South Africa. A template for Automatic Speech Recognition will be built into a web interface and the data it uses will be stored on Open Access Repository, the application is accessed via a Science Gateway. The user specifies their parameters and data on the web interface and submits the job to the Science Gateway which takes care of the rest. gLibrary may be used to store some of the statistical results from the experiment.
        Speaker: David Risinamhodzi (Northwest Univesity - South Africa)
        Material: Slides powerpoint file
      • 14:30 Implementation of eCulture Science Gateway - reloaded 30'
        The presentation regards the digital library “MuseiD-Italia”, which showcases images and metadata regarding Cultural Heritage in Italy. ICCU is trying to revamp the whole workflow in order to make it better, easier and faster, as well as adding new potential features, potentially looking at the integration of INDIGO solution. This is also intended to be a use case in order to expand experimentations in the next months to other (and bigger) ICCU-run or ICCU-led projects.
        Speaker: Luca Martinelli (ICCU - Italy)
        Material: Slides powerpoint file
      • 15:00 Intelligent Medical Image Analyzer 30'
        The proposed system is an e-infrastructure for processing medical images so that the processed data could be used for decision support during diagnosis or clinical research. The two categories of people who will most likely use our tools are clinicians and researchers conducting medical research. The frontend of the proposed system will be built using PHP, HTML, javascript and xamp. The frontend will be user friendly, interactive and will allow uploading of medical images.  It will also contain options for image processing and report generation. The backend will contain MATLAB, C++ compiler and some specialized medical image processing software packages and some test data. The proposed system will also have some storage allowance for image uploads by the user. Once images are uploaded, the software relevant for processing the particular uploaded images will be selected automatically and applied on the images. The image storage model will allow some specific types of medical images that are commonly used in medical field. 
        Speaker: Benjamin Aribisala (Lagos State University - Nigeria)
        Material: Slides powerpoint file
      • 15:30 WEKA Machine Learning in Breast Cancer 30'
        The Wisconsin Breast Cancer datasets from the UCI Machine Learning Repository is used  as a use case to classify benign and malignant samples using WEKA. The main task is to create a web interface to interact and use classification features of WEKA.
        Speaker: Stephan Mgaya (TERNET - Tanzania)
        Material: Slides pdf file
      • 16:00 Coffee break 30'
      • 16:30 Technology Transfer Alliance Collaboration Platform 30'
        The TTA Collaboration Platform is intended to be a web-based platform containing an integrated set of tools, applications, data repositories that are accessed via a portal: the TTA Portal. The motivation of developing this platform is to support collaboration and training and to foster education among the partners, sharing of all sorts of resources and dissemination of results. The platform will allow each partner to submit content such as project proposals, project documents, news update, information sharing via content lists and other kinds of content such as video or other multi-media contents that cover in a secure manner.
        Speaker: Diana Rwegasira (University of Dar es Salaam - Tanzania)
        Material: Slides pdf file
      • 17:00 iGrid - Smart Grid Capacity Development and Enhancement in Tanzania 30'
        Designing, implementing, demonstrating, testing and validating  an autonomous solar-powered LVDC nanogrid prototype, serving an off-grid community of 10-100 households that can also be integrated in a higher voltage AC/DC grid if needed, as part of of a bigger strategy to ensure access to reliable and affordable electrical power supply to all communities (especially rural).
        Speaker: Aron Kondoro (University of Dar es Salaam - Tanzania)
        Material: Slides powerpoint file
  • Thursday, 7 July 2016
    • 09:00 - 19:00 Day 4 - Code development for use cases implementation
  • Friday, 8 July 2016
    • 09:00 - 19:00 Day 5 - Code development for use cases implementation
  • Saturday, 9 July 2016
    • 09:00 - 19:00 Day 6 - Code development for use cases implementation
  • Sunday, 10 July 2016
    • 09:00 - 19:00 Day 7 - Free day
      This day is free. Excursions will be organised.
  • Monday, 11 July 2016
    • 09:00 - 19:00 Day 8 - Code development for use cases implementation
  • Tuesday, 12 July 2016
    • 09:00 - 19:00 Day 9 - Code development for use cases implementation
  • Wednesday, 13 July 2016
    • 09:00 - 19:00 Day 10 - Code development for use cases implementation
  • Thursday, 14 July 2016
    • 09:00 - 19:00 Day 11 - Code development for use cases implementation
  • Friday, 15 July 2016
    • 09:00 - 18:00 Day 12 - Use cases final presentations and wrap-up
      • 10:00 Reproducible Automatic Speech Recognition workflows - Final report 30'
        Speaker: David Risinamhodzi (Northwest Univesity - South Africa)
        Material: Slides powerpoint file
      • 10:30 Intelligent Medical Image Analyzer - Final report 30'
        Speaker: Benjamin Aribisala (Lagos State University - Nigeria)
        Material: Slides powerpoint file
      • 11:00 Coffee break 30'
      • 11:30 WEKA Machine Learning in Breast Cancer - Final report 30'
        Speaker: Stephan Mgaya (TERNET - Tanzania)
        Material: Slides powerpoint file
      • 12:00 Technology Transfer Alliance Collaboration Platform - Final report 30'
        Speaker: Diana Rwegasira (University of Dar es Salaam - Tanzania)
        Material: Slides pdf file
      • 12:30 iGrid - Smart Grid Capacity Development and Enhancement in Tanzania - Final report 30'
        Speaker: Aron Kondoro (University of Dar es Salaam - Tanzania)
        Material: Slides powerpoint file
      • 13:00 Wrap-up and closure 30'