Belo Horizonte, Minas Gerais, Brasil
5 mil seguidores Más de 500 os

Unirse para ver el perfil

Acerca de

Data Engineer with a Software Engineer mindset, looking for clean code, with test…

Artículos de Álvaro

  • Deep Learning Specialization Review

    Last week I finally finished the deeplearning.ai's Deep Learning specialization taught by Andrew Ng on Coursera. What…

    11 comentarios

Actividad

Unirse para ver toda la actividad

Experiencia y educación

  • Jungle Scout

Mira la experiencia completa de Álvaro

Mira su cargo, antigüedad y más

o

Al hacer clic en «Continuar» para unirte o iniciar sesión, aceptas las Condiciones de uso, la Política de privacidad y la Política de cookies de LinkedIn.

Licencias y certificaciones

Únete para ver todas las certificaciones

Experiencia de voluntariado

  • Gráfico AC Social

    Programming Teacher

    AC Social

    - 1 año 5 meses

    Educación

    AC Social is an initiative started at Avenue Code where people volunteer to teach children to learn to program. I was a programming teacher for students of Bueno Brandão State School for 3 semesters.

  • Gráfico Meetup de Engenharia de Dados

    Co-founder & Coordinator

    Meetup de Engenharia de Dados

    - actualidad 6 años 5 meses

    Ciencia y tecnología

    I co-founded and am a coordinator of the Data Engineering Meetup. It's a group created on the Meetup platform where we organize events focused on Data Engineering, Big Data and Distributed Computing technologies.

    Group page: www.meetup.com/engenharia-de-dados

Proyectos

  • Gap Demand Forecast

    - actualidad

    In order to assist Gap Inc.'s supply chain management, there's a Demand Forecast product that provides weekly products demand forecasts for a horizon of one year. I currently work for this product's Data Engineering team, which develops, maintains, and schedules data pipelines to provide data in the desired format and in a timely manner for the Forecast Engine team to process.

    My team is focused on the San Francisco Bay Area, where I worked onsite for my first two months in the project…

    In order to assist Gap Inc.'s supply chain management, there's a Demand Forecast product that provides weekly products demand forecasts for a horizon of one year. I currently work for this product's Data Engineering team, which develops, maintains, and schedules data pipelines to provide data in the desired format and in a timely manner for the Forecast Engine team to process.

    My team is focused on the San Francisco Bay Area, where I worked onsite for my first two months in the project. Now I'm working from Brazil, where we started to build a team.

    ► Main tasks:
    • Developed and maintained PySpark and Hive batch jobs to ingest data from different sources. Those jobs run on a big Hortonworks on-premises cluster
    • Established the Jenkins deployment pipeline's structure for the team, based on Declarative Syntax and Shared Libraries
    • Maintained job workflows on CA Workflow Automation
    • Covered PySpark applications with unit tests by mocking Hive tables and REST APIs
    • Created a PoC on Azure Databricks, proving that our application can migrate there without major changes
    • Help our Tech Lead grooming Stories for future sprints
    • Interview candidates to the team, both from Brazil and San Francisco

    ► Skills involved:
    Python, Spark, Hive, Databricks, Bash, Jenkins, CA Workflow Automation, Git, GitHub, Jira, Confluence

  • G Migration Data Pipeline

    -

    ► Description:
    I created an ETL application using Apache Beam with the Java SDK to migrate 15TB of invoice data from the last five years from an on-premises SQL Server database to another SQL Server on Google Cloud. The job should also migrate a blob XML column to Google Cloud Storage, to save space and improve query performance. It ran on top of Google Cloud DataFlow, which scaled to more than 300 worker nodes and made it possible to migrate and audit more than 40+ million…

    ► Description:
    I created an ETL application using Apache Beam with the Java SDK to migrate 15TB of invoice data from the last five years from an on-premises SQL Server database to another SQL Server on Google Cloud. The job should also migrate a blob XML column to Google Cloud Storage, to save space and improve query performance. It ran on top of Google Cloud DataFlow, which scaled to more than 300 worker nodes and made it possible to migrate and audit more than 40+ million rows/hour.

    ► Main tasks:
    • Developed an ELT application with Apache Beam/DataFlow to migrate tables from an on-premises SQL Server to one on G. It also would move a blob XML column to Google Cloud Storage
    • Developed an audit application with Apache Beam/DataFlow to compare the original and migrated tables, as well as the original blob column with the migrated documento into Cloud Storage
    • Created audit results tables to show the aggregated audit results, as well as detailed errors. This made possible for them to discover their own application's bugs that they weren't aware of
    • Reduced by 15x the original table size, making faster queries possible, index possible again and decreased operational costs as well
    • Deployed the application using Jenkins

    ► Skills involved:
    Java, Apache Beam, DataFlow, Google Cloud, Jenkins, Git

  • Avenue Code Data Science & Engineering PoCs

    -

    ► Description:
    Worked in a few internal Data Science and Engineering initiatives for Avenue Code, as well as for some potential clients.

    ► Main tasks:
    • Started the development of AC Forecast, a generic forecasting application based on Google Cloud Platform Products. Data was ingested on Cloud Storage, processed on Dataflow and the model was trained on AI Platform
    • Created a Dashboard for Vallourec using Plotly's Dash, to help their Workplace Safety team analyze incidents and…

    ► Description:
    Worked in a few internal Data Science and Engineering initiatives for Avenue Code, as well as for some potential clients.

    ► Main tasks:
    • Started the development of AC Forecast, a generic forecasting application based on Google Cloud Platform Products. Data was ingested on Cloud Storage, processed on Dataflow and the model was trained on AI Platform
    • Created a Dashboard for Vallourec using Plotly's Dash, to help their Workplace Safety team analyze incidents and accidents in a centralized manner. Also helped to cluster such incidents and accidents in semantical topics using BERT and K-Means
    • Created an integration between AC Insight (a sentiment classification system) and Avenue Code's website. It was a dashboard built using Chart.js, vanilla JavaScript, HTML and CSS, backed by a Node.js server, which would check for new sentiments every 5 seconds on a MongoDB collection
    • Proposed and created a real-time dashboard architecture using technologies like Node.js, Socket.IO, and Kafka and Chart.js for AC Insight (mentioned above)

    ► Skills involved:
    Python, Keras, scikit-learn Google Cloud, Kafka, Docker, Mongo DB, JavaScript, HTML, CSS, Plotly, Dash

  • KPI and Network Parameters Correlation Data Pipeline

    -

    ► Description:
    A module that finds a correlation between good levels of mobile networks Key Performance Indicators (KPIs, like Call Drop rate, Traffic Volume, etc.) and a specific combination of network parameters. It was modeled as a classifier that had thousands of parameter columns as its inputs (indexed by network cell/antenna ID and would output the KPI level (good or bad performance). In order to infer the parameter combination correlated with good and bad KPIs were found by the…

    ► Description:
    A module that finds a correlation between good levels of mobile networks Key Performance Indicators (KPIs, like Call Drop rate, Traffic Volume, etc.) and a specific combination of network parameters. It was modeled as a classifier that had thousands of parameter columns as its inputs (indexed by network cell/antenna ID and would output the KPI level (good or bad performance). In order to infer the parameter combination correlated with good and bad KPIs were found by the model's (Gradient Boosting Trees) feature importances.

    Due to the large number of tables that needed to be ed, as well as the number of resulting columns and records, it was a Big Data problem that needed careful treatment. In order to deal with it, I created a data engineering pipeline based on Apache Spark.

    ► Main tasks:
    • Responsible for researching several data ingestion frameworks, which led to the choice for SparkSQL.
    • Designed and implemented a er feature that concurrently s source tables every day using Spark and persists them as Parquet files.
    • Designed and implemented a column filtering feature that concurrently dropped the unnecessary network parameter columns based on their distributions.
    • Designed and implemented a feature that generates intermediate tables indexed with the same keys as the resulting should have. It was implemented with Spark and ran concurrently.
    • Designed and implemented a feature that s all those tables in order to create the final dataset. It was implemented with Spark and ran concurrently.
    • Responsible for integrating all the previous features in a single module, both vendor and technology agnostic.
    • Responsible for creating interactive reports showing the module's performance with Plotly. The data used was parsed from the module's logs.

    ► Skills involved:
    Python, Spark, MySQL, Git, Jupyter Notebook, PySpark, Plotly

  • KPI Forecasting

    -

    ► Description
    Study on different methods of time series forecasting applied to a large mobile phone operator company's KPIs. Based on real traffic data from a mobile phone operator, forecasting methods were applied, such as the ARIMA model, Exponential Smoothing methods and Neural Networks models. The results were compared and the models based on neural networks obtained the best results.

    ► Main tasks:
    • Obtained eight weeks of hourly sampled data for 240 time series of traffic…

    ► Description
    Study on different methods of time series forecasting applied to a large mobile phone operator company's KPIs. Based on real traffic data from a mobile phone operator, forecasting methods were applied, such as the ARIMA model, Exponential Smoothing methods and Neural Networks models. The results were compared and the models based on neural networks obtained the best results.

    ► Main tasks:
    • Obtained eight weeks of hourly sampled data for 240 time series of traffic volume transferred between 's mobile phones and a Brazilian mobile network company's base transceiver stations.
    • Interpolated each series missing values. For the ARIMA model, a Box-Cox transformation was conducted, in order to stabilize the series variance.
    • For each one of the 240 time series, the following models were trained on the first seven weeks of data: Seasonal Naïve (the benchmark to be beaten), three Exponential Smoothing models (Holt-Winters, Double Seasonal Holt-Winters, and TBATS), the ARIMA model and three Neural Network architectures (Feedforward, Recurrent, and Long Short-Term Memory).
    • For each model, two types of forecasts were conducted on the last week of data in order to evaluate the model (using the RMSE metric): one-step and multi-step ahead forecasts.

    ► Skills involved:
    Python, R Language, Machine Learning, Pandas, Keras, Jupyter Notebook, Git, GitHub

  • KPI Classifier

    -

    ► Description:
    Bwtech developed an anomaly detection module that would run different anomaly detection algorithms, for different types of KPI time series.

    There were three main types of KPIs and we needed a way to identify it automatically because although some of them were known, s could create new KPIs that the system wouldn't know the type. In order to deal with that, I developed a KPI classifier.

    ► Main tasks:
    • Developed a utility using MySQL Python connector where…

    ► Description:
    Bwtech developed an anomaly detection module that would run different anomaly detection algorithms, for different types of KPI time series.

    There were three main types of KPIs and we needed a way to identify it automatically because although some of them were known, s could create new KPIs that the system wouldn't know the type. In order to deal with that, I developed a KPI classifier.

    ► Main tasks:
    • Developed a utility using MySQL Python connector where the specifies a query and it returns a Pandas DataFrame with the result.
    • Used that utility class to collect two months of daily records of thousands of KPI time series, with an even number of series for each of the three types of KPIs, in order to have a balanced dataset.
    • Windowed the time series to create multiple smaller time series from each original series, in order to have a larger dataset.
    • From each time series, extracted a lot of features using the TSFRESH Python package, which was used as the dataset input. The output was the KPI type.
    • Trained a Random Forest model that achieved 98% of accuracy in the test set.

    ► Skills involved:
    Python, Machine Learning, MySQL, Git, Jupyter Notebook, Pandas

  • Mine Manager Portal

    -

    ► Description:
    A copper production management web-based system for an Australian mining company. The entire mining process was mapped: planning, extraction, storage, crushing, grinding, among other steps.

    ► Main tasks:
    • Responsible to define the frontend architecture based on Vue.js, Vue Router, and Element UI.
    • Responsible to define the frontend style guide.
    • Responsible to implement the interfaces exactly as specified by the client.
    • Proposed the use of Git, the…

    ► Description:
    A copper production management web-based system for an Australian mining company. The entire mining process was mapped: planning, extraction, storage, crushing, grinding, among other steps.

    ► Main tasks:
    • Responsible to define the frontend architecture based on Vue.js, Vue Router, and Element UI.
    • Responsible to define the frontend style guide.
    • Responsible to implement the interfaces exactly as specified by the client.
    • Proposed the use of Git, the use of a Gitlab server and trained both the team and the entire company to use them.
    • Responsible for many backend tasks, which was following a Domain-Driven Design architecture based on the .NET framework.

    ► Skills involved:
    Vue.js, .NET WEB API, Entity Framework, SQL Server, Git, GitLab

  • Waysides Web

    -

    ► Description:
    A web-based system to monitor one of the biggest mining companies of the world's train fleet based on railroad sensors. The system was already in production in an older and non-performant version, where the frontend was an Excel spreadsheet that would directly remotely run database procedures.

    ► Main tasks:
    • Responsible for migrating Excel sheets functionality to a web application using AngularJS and Bulma, deg and developing reusable web-components.
    •…

    ► Description:
    A web-based system to monitor one of the biggest mining companies of the world's train fleet based on railroad sensors. The system was already in production in an older and non-performant version, where the frontend was an Excel spreadsheet that would directly remotely run database procedures.

    ► Main tasks:
    • Responsible for migrating Excel sheets functionality to a web application using AngularJS and Bulma, deg and developing reusable web-components.
    • Migrated most of the business logic to a .NET application.

    ► Skills involved:
    C#, .NET WEB API, HTML, CSS, JavaScript, AngularJS, SVN, Bulma

  • Radix Data Science PoCs

    -

    ► Description:
    In addition to the first Radix Machine Learning project, we did some studies as initiatives to attract Data Science projects.

    ► Main tasks:
    • An analysis of possible causes of fouling factor (i.e. accumulation of unwanted material) on ExxonMobil's heat exchangers. The study was based on the feature importances of a XGBoost regression model where the response variable was the calculated fouling factor and had the plant's sensor measurements as inputs.
    • Another…

    ► Description:
    In addition to the first Radix Machine Learning project, we did some studies as initiatives to attract Data Science projects.

    ► Main tasks:
    • An analysis of possible causes of fouling factor (i.e. accumulation of unwanted material) on ExxonMobil's heat exchangers. The study was based on the feature importances of a XGBoost regression model where the response variable was the calculated fouling factor and had the plant's sensor measurements as inputs.
    • Another study was for a proactive predictive maintenance of Vale's train fleet. It
    was based on wheel brake thickness measurements made by sensors on the railroads. Since the wheel thickness data wore out in a linear pattern, a regression model was fit in order to be able to predict future thickness values. A system based on that model could notify the operators when future thicknesses exceeded a certain threshold.

    ► Skills involved:
    Python, Machine Learning, SVN, Jupyter Notebook, XGBoost, scikit-learn, Pandas, Matplotlib, Seaborn

  • Oil Expert

    -

    ► Description:
    System developed for a Caterpillar dealer for making predictive maintenance based on fluid analysis. It would inform both an equipment's oil sample criticality level (should or shouldn't go to maintenance?), as well as a diagnostic text (e.g.: "The debris concentration is normal. There is a high level of iron. Check again"). Since Radix wanted to sell it to other Caterpillar dealers, there were more tasks to be done.

    ► Main tasks:
    • Responsible for training models…

    ► Description:
    System developed for a Caterpillar dealer for making predictive maintenance based on fluid analysis. It would inform both an equipment's oil sample criticality level (should or shouldn't go to maintenance?), as well as a diagnostic text (e.g.: "The debris concentration is normal. There is a high level of iron. Check again"). Since Radix wanted to sell it to other Caterpillar dealers, there were more tasks to be done.

    ► Main tasks:
    • Responsible for training models with other dealers datasets, which led to high accuracies (as the original system did).
    • Responsible for writing a paper (as a co-author) about the system, which was presented at ABM Week 2017.
    • Responsible for creating the system's documentation.

    ► Skills involved:
    Python, Machine Learning, Docker, AWS, SVN, Jupyter Notebook, scikit-learn, Pandas, XGBoost, Matplotlib, Seaborn

  • Marketing Enablement Portal

    -

    ► Description:
    Development of a portal to the world's biggest company's marketing team to create multiple campaigns and thousands of leads to Salesforce through an -friendly interface following the Material Design specifications.

    ► Main tasks:
    • Implemented UI tasks specified by the designer using Angular Material.
    • Responsible for Lead tasks, a feature where s would thousands of leads from Spreadsheets via Google Cloud Endpoints, parsed with Google…

    ► Description:
    Development of a portal to the world's biggest company's marketing team to create multiple campaigns and thousands of leads to Salesforce through an -friendly interface following the Material Design specifications.

    ► Main tasks:
    • Implemented UI tasks specified by the designer using Angular Material.
    • Responsible for Lead tasks, a feature where s would thousands of leads from Spreadsheets via Google Cloud Endpoints, parsed with Google Sheets API and persisted on Google Cloud Datastore. After that, App Engine tasks would be queued on Task Queues in order to transfer those leads to Salesforce using their Bulk API.
    • Responsible for Campaign Creation tasks, a feature where s of the portal would create multiple marketing campaigns on Salesforce.
    • Responsible for parsing project setup configurations on Google Sheets with Google Apps Scripts, which was also used to make API calls to populate Datastore.
    • Responsible for creating unit tests using JUnit.
    • Always made sure that my tasks didn't create new bugs or code smells on SonarQube, a platform that in some circumstances I would refer to in order to make our codebase cleaner.

    ► Skills involved:
    Java, Google Cloud, Google App Engine, Salesforce, HTML, CSS, JavaScript, AngularJS, GIT, Jira, Sonar, Maven, Eclipse, Jenkins

Idiomas

  • Português

    Competencia bilingüe o nativa

  • Inglês

    Competencia bilingüe o nativa

  • Espanhol

    Competencia básica profesional

Recomendaciones recibidas

5 personas han recomendado a Álvaro

Unirse para verlo

Más actividad de Álvaro

Ver el perfil completo de Álvaro

  • Descubrir a quién conocéis en común
  • Conseguir una presentación
  • ar con Álvaro directamente
Unirse para ver el perfil completo

Perfiles similares

Otras personas con el nombre de Álvaro Lemos en Brasil

Añade nuevas aptitudes con estos cursos