FS-TFP/federatedscope/core/data
HengZhang 6ea133716f modifications on original FS
modifications on original FS
2024-11-21 12:37:27 +08:00
..
README.md The origin version of FederatedScope 2024-11-21 11:25:24 +08:00
__init__.py The origin version of FederatedScope 2024-11-21 11:25:24 +08:00
base_data.py The origin version of FederatedScope 2024-11-21 11:25:24 +08:00
base_translator.py The origin version of FederatedScope 2024-11-21 11:25:24 +08:00
dummy_translator.py The origin version of FederatedScope 2024-11-21 11:25:24 +08:00
raw_translator.py The origin version of FederatedScope 2024-11-21 11:25:24 +08:00
utils.py modifications on original FS 2024-11-21 12:37:27 +08:00
wrap_dataset.py The origin version of FederatedScope 2024-11-21 11:25:24 +08:00

README.md

DataZoo

FederatedScope provides a rich collection of federated datasets for researchers, including images, texts, graphs, recommendation systems, and speeches, as well as utility classes BaseDataTranslator for building your own FS datasets.

Built-in FS data

All datasets can be accessed from federatedscope.core.auxiliaries.data_builder.get_data, which are built to federatedscope.core.data.StandaloneDataDict (for more details, see [DataZoo advanced]). By setting cfg.data.type = DATASET_NAME, FS would download and pre-process a specific dataset to be passed to FedRunner. For example:

# Source: federatedscope/main.py

data, cfg = get_data(cfg)
runner = FedRunner(data=data,
                   server_class=get_server_cls(cfg),
                   client_class=get_client_cls(cfg),
                   config=cfg.clone())

We provide a look-up table for you to get started with our DataZoo:

cfg.data.type Domain
FEMNIST CV
Celeba CV
{DNAME}@torchvision CV
Shakespeare NLP
SubReddit NLP
Twitter (Sentiment140) NLP
{DNAME}@torchtext NLP
{DNAME}@huggingface_datasets NLP
Cora Graph (node-level)
CiteSeer Graph (node-level)
PubMed Graph (node-level)
DBLP_conf Graph (node-level)
DBLP_org Graph (node-level)
csbm Graph (node-level)
Epinions Graph (link-level)
Ciao Graph (link-level)
FB15k Graph (link-level)
FB15k-237 Graph (link-level)
WN18 Graph (link-level)
MUTAG Graph (graph-level)
BZR Graph (graph-level)
COX2 Graph (graph-level)
DHFR Graph (graph-level)
PTC_MR Graph (graph-level)
AIDS Graph (graph-level)
NCI1 Graph (graph-level)
ENZYMES Graph (graph-level)
DD Graph (graph-level)
PROTEINS Graph (graph-level)
COLLAB Graph (graph-level)
IMDB-BINARY Graph (graph-level)
IMDB-MULTI Graph (graph-level)
REDDIT-BINARY Graph (graph-level)
HIV Graph (graph-level)
ESOL Graph (graph-level)
FREESOLV Graph (graph-level)
LIPO Graph (graph-level)
PCBA Graph (graph-level)
MUV Graph (graph-level)
BACE Graph (graph-level)
BBBP Graph (graph-level)
TOX21 Graph (graph-level)
TOXCAST Graph (graph-level)
SIDER Graph (graph-level)
CLINTOX Graph (graph-level)
graph_multi_domain_mol Graph (graph-level)
graph_multi_domain_small Graph (graph-level)
graph_multi_domain_biochem Graph (graph-level)
cikmcup Graph (graph-level)
toy Tabular
synthetic Tabular
quadratic Tabular
{DNAME}openml Tabular
vertical_fl_data Tabular(vertical)
VFLMovieLens1M Recommendation
VFLMovieLens10M Recommendation
HFLMovieLens1M Recommendation
HFLMovieLens10M Recommendation
VFLNetflix Recommendation
HFLNetflix Recommendation

DataZoo Advanced

In this section, we will introduce key concepts and tools to help you understand how FS data works and how to use it to build your own data in FS.

Concepts:

  • federatedscope.core.data.ClientData

    • ClientData is a subclass of dict. In federated learning, each client (server) owns a ClientData for training, validating, or testing. Thus, each ClientData has one or more of train, val, and test as keys, and DataLoader accordingly.

    • The DataLoader of each key is created by setup() method, which specifies the arguments of DataLoader, such as batch_size, shuffle of cfg.

      Example:

      # Instantiate client_data for each Client
      client_data = ClientData(DataLoader, 
                               cfg, 
                               train=train_data, 
                               val=None, 
                               test=test_data)
      # other_cfg with different batch size
      client_data.setup(other_cfg)
      print(client_data)
      
      >> {'train': DataLoader(train_data), 'test': DataLoader(test_data)}
      
  • federatedscope.core.data.StandaloneDataDict

    • StandaloneDataDict is a subclass of dict. As the name implies, StandaloneDataDict consists of all ClientData with client index as key (0, 1, 2, ...) in standalone mode. The key 0 is the data of the server for global evaluation or other usages.
    • The method preprocess() in StandaloneDataDict makes changes to inner ClientData when cfg changes, such as in global mode, we set cfg.federate.method == "global", and StandaloneDataDict will merge all ClientData to one client to perform global training.

Tools

  • federatedscope.core.data.BaseDataTranslator

    • BaseDataTranslator converts torch.utils.data.Dataset or dict of data split to StandaloneDataDict according to cfg. After translating, it can be directly passed to FedRunner to launch a FL course.

    • BaseDataTranslator will split data to train, val, and test by cfg.data.splits (ML split). And using Splitter to split each data split to each client (FL split). In order to use BaseDataTranslator, cfg.data.splitter, cfg.federate.client_num, and other arguments of Splitter must be specified.

    Example:

    cfg.data.splitter = 'lda'
    cfg.federate.client_num = 5
    cfg.data.splitter_args = [{'alpha': 0.2}]
    
    translator = BaseDataTranslator(global_cfg, DataLoader)
    raw_data = CIFAR10()
    fs_data = translator(raw_data)
    
    runner = FedRunner(data=fs_data,
                       server_class=get_server_cls(cfg),
                       client_class=get_client_cls(cfg),
                       config=cfg.clone())
    
  • federatedscope.core.splitters

    • To generate simulated federation datasets, we provide splitter who are responsible for dispersing a given standalone dataset into multiple clients, with configurable statistical heterogeneity among them.

    We provide a look-up table for you to get started with our Splitter:

    cfg.data.splitter Domain Arguments
    LDA Generic alpha
    Louvain Graph (node-level) delta
    Random Graph (node-level) sampling_rate, overlapping_rate, drop_edge
    rel_type Graph (link-level) alpha
    Scaffold Molecular -
    Scaffold_lda Molecular alpha
    Rand_chunk Graph (graph-level) -