DataHub - Synthetic data generation

Description

Business Problem

A lack of useful test data hampers the development of new systems. Cloning production data (if it exists) creates significant risks requiring continued management with the headache of redacting PII data and continued oversight from compliance and info-sec teams.

Sharing data with potential partners is problematic and data redaction is not always the ideal solution as distributions in the data are still exposed

Gaining the ability to use confidential sets with external cloud services can be a lengthily process. Teams often are unable to perform rapid POC's with external cloud services without lengthily internal engagements with info-security / compliance teams.

Proposed Solution

DataHub is a python based library to help developers with generating synthetic data. DataHub relies on familiar python libraries such as pandas, numpy, scip, nltk which are already part of every data-science toolkit. Datahub is split into two primary feature areas. The first is to analyse distributions within an input data-set producing a statistical model. The second feature area is synthetic data production which can use generated statistical models, and blend in hand-crafted attributes. Datahub can equally be used to build data "from scratch"

Note: Datahub is not a standalone tool but a set of python libraries

Current State

Library complete, now being expanded to cover various use cases within Citi such as trade generation, market risk, accounts, account on-boarding

Existing Materials

Not public yet

Development Team

Paul Groves - paul.timothy.groves@citi.com

100% Done
Loading...

Activity

Maurizio Pillitu 
June 15, 2020 at 4:03 PM

Announcement was sent out. Onboarding completed!

Maurizio Pillitu 
May 15, 2020 at 8:01 AM

https://github.com/finos/datahub is transferred and public.

Next step is to send out an announcement to announce@finos.org, we've put together a simple template you can follow if you'd like.

Thanks!

Maurizio Pillitu 
March 30, 2020 at 9:16 PM

Status update : we're iterated on legal and security validation; we found few small items to address, as commented on https://finosfoundation.atlassian.net/browse/CONTRIB-63?focusedCommentId=14475&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-14475 , the team is working to address them, so we can move forward with the contribution process.

Colin Eberhardt (He/Him) 
October 7, 2019 at 10:21 AM

Not a prereq to accept, but is their pent-up interest in the community?

I think I can answer that one - yes, there certainly is an interest in synthetic data generation, which is why we created and contributed DataHelix:

As Andrew mentions, there are some obvious synergies between the two projects, and of course some overlap too. Definitely interested to see how we can work together.

 

Colin Eberhardt (He/Him) 
October 7, 2019 at 10:14 AM

Not a prereq to accept, but is their pent-up interest in the community?

I think I can answer that one - yes, there certainly is an interest in synthetic data generation, which is why we created and contributed DataHelix:

As Andrew mentions, there are some obvious synergies between the two projects, and of course some overlap too. Definitely interested to see how we can work together.

 

Done

Details

Assignee

Reporter

Program

Created October 2, 2019 at 8:47 AM
Updated June 15, 2020 at 4:03 PM
Resolved June 15, 2020 at 4:03 PM