Forum: >>> Magnum BBS <<<

Dark
Log in

Username Password

Data Quality Assessment

From Mike Dwell@21:1/5 to All on Fri Jan 28 21:37:14 2022

Hi there,

I have been bumping my head against wall in trying to figure out a good real-world solution for this challenging problem that my friend asked me.

Could you please give some pointers?

Lets say we want to assess the data quality of Company A's big data. Due to both security, privacy and work-load concerns, it's impossible to view/access the whole data repository(data-lake or data-ocean) of A.

We can only request a sample of Company A's big data and then hopefully we can apply some quality-assess-toolkit to do some analysis.

My question is: how to draw such a data sample? what requirements should we set up for such a data sample?

Moreover, Company A may "optimize" or "decorate" the sample data that he gives out, what might be a good scheme or mechanism design such that we can avoid his "optimization" or "decoration"?

Could anybody please give some pointers?

Thanks a lot!

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Duffy@21:1/5 to Mike Dwell on Sat Jan 29 09:11:42 2022

Mike Dwell <mike.dwellsiegmund@gmail.com> wrote:

Lets say we want to assess the data quality of Company A's big data. Due
to both security, privacy and work-load concerns, it's impossible to view/access the whole data repository(data-lake or data-ocean) of A.

We can only request a sample of Company A's big data and then hopefully we can apply some quality-assess-toolkit to do some analysis.

My question is: how to draw such a data sample? what requirements should we set up for such a data sample?

Moreover, Company A may "optimize" or "decorate" the sample data that he gives out, what might be a good scheme or mechanism design
such that we can avoid his "optimization" or "decoration"?

Hi. I don't think there is a generic "quality-assess" approach, as
this so depends on the nature of the work the company does, aside from
obvious things like range checks on dates etc (which they usually do on
the fly anyway) or batch-to-batch comparisons. Like can you validate
data points using external data sources? If it is "big data", then
they can still easily provide a far bigger sample than you will ever
need for statistical power purposes. Personally, I would first do an
"eyeball" examination of a relatively small contiguous data series
just to get a feel of what goes on.

Just 2c, David Duffy.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online
Recent Visitors
- Thlc
  Sat Sep 13 17:11:34 2025
  from Rognac, France via Telnet
- Thlc
  Sat Sep 13 17:04:03 2025
  from Rognac, France via Telnet
- Thlc
  Sat Sep 13 16:32:19 2025
  from Rognac, France via SSH
- Thlc
  Sat Sep 13 15:41:11 2025
  from Rognac, France via SSH
- Thlc
  Sat Sep 13 07:56:03 2025
  from Rognac, France via SSH
- Gretchiie
  Sat Sep 13 07:22:10 2025
  from Derry, Nh via Telnet
- Thlc
  Sat Sep 13 06:57:56 2025
  from Rognac, France via SSH
- Thlc
  Sat Sep 13 06:47:28 2025
  from Rognac, France via SSH

System Info

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	546
Nodes:	16 (2 / 14)
Uptime:	149:15:00
Calls:	10,383
Calls today:	8
Files:	14,054
D/L today:	2 files (1,861K bytes)
Messages:	6,417,765