Data science pipelines from head to toes: a formal & executable tool for all Status, problems, challenges, and open issues

Palestras e Seminários

20/09/2023

14:30

Auditório Fernão Stella de Rodrigues Germano - ICMC

Palestrante: Genoveva Vargas Solar

Responsável: Cristina Aguiar (cdac@icmc.usp.br)

Modo: Presencial

Data science pipelines from head to toes: a formal & executable tool for all

Status, problems, challenges, and open issues 1

Vast collections of heterogeneous data have become the backbone of scientific, analytic, and

forecasting processes. It is possible to compute mathematical models to understand and

predict phenomena by combining simulation techniques, artificial vision, and artificial

learning with data science techniques. Data must go through complex and repetitive

processing and analysis pipelines, namely data science pipelines, to achieve this ambitious

objective.

A data science pipeline is a set of processes that convert raw data into actionable answers to

research/business questions to provide insights and solutions [to experimental sciences]

problems and enable data-driven decisions. The objective is to automate the process of extracting

data from multiple sources, cleaning and transforming it, analyzing it, and presenting the results

in an understandable format. Data science pipelines can include machine learning, statistical

and numerical models, and data visualization and interpretation tools. Data scientists use

pipelines to automate the process flow automating repetitive tasks from raw data to

[scientific]/business insights to enable the reproducibility of results and share workflows

with other communities.

Various frameworks are available for enacting data science pipelines, depending on the

project's specific needs. Some popular frameworks include Apache Airflow, Prefect,

Kubeflow, and MLFlow. The enactment of data science pipelines must balance the delivery of

different types of services such as (i) hardware (computing, storage, and memory), (ii)

communication (bandwidth and reliability) and scheduling, (iii) greedy analytics and mining

with high in-memory and computing cycles requirements.

This talk introduces critical challenges and current results regarding the development of

data science pipelines and insists on how to consider efficient enactment strategies to

explore experimental sciences problems that can go beyond available analytics scales and

contribute to performing continuous online data-centric sciences experiments.

Speaker:

Genoveva Vargas Solar (http://www.vargas-solar.com) is a French Council of

Scientific Research (CNRS) principal researcher. She is a member of the DataBase

group of Laboratory on Informatics on Image and Information Systems (LIRIS). She

is a regular member of the Mexican Academia of Computing ( AMEXCOMP ). Her

particular education includes two PhDs and two master’s degree respectively in

Computing Science and Compared Literature (Mythocritics and mythanalysis) from

University of Grenoble, and several certificates on feminist and gender studies from

the National Autonomous University of Mexico (UNAM).

Genoveva Vargas-Solar is a gender equity officer of the G ender Equity Commission at the

LIRIS lab. She represents EDBT Endowment (a major European conference in databases) in

the D&I database interconference initiative. She is a member of the Tierra Común activist

group and participates in the European project Gender STI as part of the CNRS partner

group.

She contributes to the construction of service-based database/data science

management systems. The objective is to design data science workflows, new

queries, and enactment services guided by Service Level Objectives (SLO). Her work

mainly addresses data science queries exploiting graphs. She proposes query

evaluation methodologies, algorithms, and tools for composing, deploying, and

executing data science functions on just in time architectures (disaggregated data

centres). She conducts fundamental and applied research activities for addressing

these challenges on different architectures ARM, raspberry, cluster, cloud, and HPC.

Veja também

Data science pipelines from head to toes: a formal & executable tool for all Status, problems, challenges, and open issues

CONECTE-SE COM A GENTE